Delayed Notifications
Incident Report for PagerDuty
Postmortem

Summary

On January 6th, 2023, between 21:20 UTC and 23:44 UTC, PagerDuty experienced a global operational incident that affected our incidents and events notification system. During this time, notifications from incidents in the PagerDuty platform did not make it to their respective destination via SMS, phone, push, or email.

We failed to deliver a small percentage of notifications in the US Service Region and have already contacted affected customers. In the EU Service Region, notifications were delayed, in most cases, up to 10 minutes, with a small percentage being delayed up to 1 hour and 40 minutes. All notification events in the EU Service Region were processed by the end of the incident. 

This issue did not impact the viewing or updating of incidents in the Web UI, Mobile UI or REST API. 

What Happened

The data streaming platform used to publish content from our MySQL clusters to Kafka infrastructure encountered a failure mode during a series of changes intended to increase the operational resilience of the platform. As a result of this failure, events destined for downstream micro-services could not flow successfully through the pipeline. While we continued to accept new incoming events and allow interaction with customer-facing web applications, processing stopped on downstream services, including the services responsible for sending out notifications. This resulted in customers not receiving timely SMS, phone, push, or email notifications. Events ingested during this time remained safely queued until our processing pipeline could act on them. 

Our engineers were able to remediate the failure by taking corrective action against the affected service. After this action queued, notification began processing correctly and our team observed service restoration. A backlog of events was then processed which may have led to unexpected escalations and repeated notifications as our service processed events in a First In, First Out (FIFO) manner. When the queue began processing events in the US Service Region, some failed our 2 hours max delay time frame and were not delivered.

Next Steps

Our engineering team fully understands the nature of this failure and is working to apply mitigations to improve the resilience of this layer of our platform, including updates to incident runbooks to help diagnose similar issues more quickly. We apologize for the inconvenience that this has caused. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Jan 13, 2023 - 23:01 UTC

Resolved
We have resolved the incident where some PagerDuty customers in US and EU service regions experienced notification delays. Unfortunately, not all notifications were able to be processed. We will be reaching out to affected customers individually. There is no ongoing impact on customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Jan 07, 2023 - 00:03 UTC
Update
We recently experienced another brief notifications delay in the EU region. We deployed a fix and monitoring the EU region recovery.
Posted Jan 06, 2023 - 23:27 UTC
Update
Notifications are fully functional in both EU and US service regions. We are currently working towards processing a set number of blocked notifications.
Posted Jan 06, 2023 - 23:03 UTC
Monitoring
We are monitoring improvement in an incident with notification delays. We have a fix we are currently deploying in both EU and US. We expect a full resolution in approximately 15 minutes and will provide an update within that time.
Posted Jan 06, 2023 - 22:49 UTC
Investigating
We are investigating potential issues where some customers might be experiencing notification delays. On confirmation, we will update you with further impact and severity within 15 minutes.
Posted Jan 06, 2023 - 22:39 UTC
This incident affected: Notification Delivery (Notification Delivery (US), Notification Delivery (EU)).