On April 25th, 2022, between 02:44 UTC and 03:30 UTC, PagerDuty experienced an operational incident that delayed incident and event notifications in the US Service Region.
During this time, notifications from incidents in the PagerDuty platform did not make it to their respective destinations via SMS, phone, push, or email. No events were lost or dropped during this time, and all notifications were eventually sent after a maximum delay of approximately 35 minutes. This issue did not impact the viewing or updating of incidents in the Web UI, Mobile UI or REST API.
Our data streaming platform, which we use to publish events from our MySQL databases to our Kafka clusters, encountered a failure condition that required human intervention to correct. A previously implemented mitigation designed to correct this specific issue failed to correct the condition automatically. As a result of this failure, events destined for downstream micro-services could not flow successfully through the pipeline. While we continued to accept new incoming events and allow interaction with customer-facing web applications, processing stopped on downstream services, including the services responsible for sending out notifications. Events ingested during this time remained safely queued until our processing pipeline could act on them.
Our engineers were able to remediate the failure by switching the upstream servers for the data streaming platform. This action yielded the desired outcomes, and our team observed service restoration. A backlog of events was then processed, in a First In, First Out (FIFO) manner with a maximum delay of about 40 minutes.
This issue also caused the incident timeline feature to be delayed on approximately 2% of accounts for up to 90 minutes due to an unrelated data migration process that was running during and prior to the main failure.
Following this incident, our teams conducted a thorough post-mortem investigation which identified a series of events that led to a failure of this nature. Our engineering teams have worked diligently to address these findings and ensure we've adequately guarded against this manner of failure from now on. The corrective actions included the following:
We sincerely apologize for the delayed notifications you or your teams experienced. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to email@example.com.