On January 6th, 2023, between 21:20 UTC and 23:44 UTC, PagerDuty experienced a global operational incident that affected our incidents and events notification system. During this time, notifications from incidents in the PagerDuty platform did not make it to their respective destination via SMS, phone, push, or email.
We failed to deliver a small percentage of notifications in the US Service Region and have already contacted affected customers. In the EU Service Region, notifications were delayed, in most cases, up to 10 minutes, with a small percentage being delayed up to 1 hour and 40 minutes. All notification events in the EU Service Region were processed by the end of the incident.
This issue did not impact the viewing or updating of incidents in the Web UI, Mobile UI or REST API.
The data streaming platform used to publish content from our MySQL clusters to Kafka infrastructure encountered a failure mode during a series of changes intended to increase the operational resilience of the platform. As a result of this failure, events destined for downstream micro-services could not flow successfully through the pipeline. While we continued to accept new incoming events and allow interaction with customer-facing web applications, processing stopped on downstream services, including the services responsible for sending out notifications. This resulted in customers not receiving timely SMS, phone, push, or email notifications. Events ingested during this time remained safely queued until our processing pipeline could act on them.
Our engineers were able to remediate the failure by taking corrective action against the affected service. After this action queued, notification began processing correctly and our team observed service restoration. A backlog of events was then processed which may have led to unexpected escalations and repeated notifications as our service processed events in a First In, First Out (FIFO) manner. When the queue began processing events in the US Service Region, some failed our 2 hours max delay time frame and were not delivered.
Our engineering team fully understands the nature of this failure and is working to apply mitigations to improve the resilience of this layer of our platform, including updates to incident runbooks to help diagnose similar issues more quickly. We apologize for the inconvenience that this has caused. For any questions, comments, or concerns, please contact us at email@example.com.