On March 20th, 2022, between 07:56 UTC and 09:43 UTC, PagerDuty experienced an operational incident that severely delayed incident and event notifications in the US Service Region.
During this time, notifications from incidents in the PagerDuty platform did not make it to their respective destinations via SMS, phone, push, or email. No events were lost or dropped during this time, and all notifications were eventually sent after a maximum delay of approximately two hours. This issue did not impact the viewing or updating of incidents in the Web UI, Mobile UI or REST API.
Our data streaming platform, which we use to publish events from our MySQL databases to our Kafka clusters, encountered a novel failure condition that required extended investigation and human intervention to correct. As a result of this failure, events destined for downstream micro-services could not flow successfully through the pipeline. While we continued to accept new incoming events and allow interaction with customer-facing web applications, processing stopped on downstream services, including the services responsible for sending out notifications. Events ingested during this time remained safely queued until our processing pipeline could act on them.
Our engineers were able to remediate the failure by switching the upstream servers for the data streaming platform. This action yielded the desired outcomes, and our team observed service restoration. A backlog of events was then processed, which may have led to unexpected escalations and repeated notifications as our service processed events in a First In, First Out (FIFO) manner.
Following this incident, our teams conducted a thorough post mortem investigation which identified a series of events that led to a failure of this nature. Our engineering teams have worked diligently to address these findings and ensure we've adequately guarded against this manner of failure from now on. The corrective actions included the following:
We sincerely apologize for the delayed and unexpected notifications you or your teams experienced. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to email@example.com.