Notification deliverability issues
Incident Report for PagerDuty
Postmortem

Summary

On April 25th, 2022, between 02:44 UTC and 03:30 UTC, PagerDuty experienced an operational incident that delayed incident and event notifications in the US Service Region.

During this time, notifications from incidents in the PagerDuty platform did not make it to their respective destinations via SMS, phone, push, or email. No events were lost or dropped during this time, and all notifications were eventually sent after a maximum delay of approximately 35 minutes. This issue did not impact the viewing or updating of incidents in the Web UI, Mobile UI or REST API.   

What Happened

Our data streaming platform, which we use to publish events from our MySQL databases to our Kafka clusters, encountered a failure condition that required human intervention to correct. A previously implemented mitigation designed to correct this specific issue failed to correct the condition automatically. As a result of this failure, events destined for downstream micro-services could not flow successfully through the pipeline. While we continued to accept new incoming events and allow interaction with customer-facing web applications, processing stopped on downstream services, including the services responsible for sending out notifications. Events ingested during this time remained safely queued until our processing pipeline could act on them. 

Our engineers were able to remediate the failure by switching the upstream servers for the data streaming platform. This action yielded the desired outcomes, and our team observed service restoration. A backlog of events was then processed, in a First In, First Out (FIFO) manner with a maximum delay of about 40 minutes. 

This issue also caused the incident timeline feature to be delayed on approximately 2% of accounts for up to 90 minutes due to an unrelated data migration process that was running during and prior to the main failure.

What We Are Doing About This

Following this incident, our teams conducted a thorough post-mortem investigation which identified a series of events that led to a failure of this nature. Our engineering teams have worked diligently to address these findings and ensure we've adequately guarded against this manner of failure from now on. The corrective actions included the following:

  • We've implemented further code changes to our service container entry point to automate the correction of this failure mode.
  • We are investigating alternative approaches to how we load-balance this service to ensure it isn’t possible to get in this state in the future.
  • We’re submitting an upstream patch to the codebase used by our data streaming platform to make it more resilient to this failure mode.
  • We’ve implemented an additional step to the back pressure mechanism for migration scripts so that future migrations are paused in the event of a data streaming outage.

We sincerely apologize for the delayed notifications you or your teams experienced. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted May 05, 2022 - 16:43 UTC

Resolved
Following the remediation of the notification delivery issue, we have now also confirmed full recovery of incident timeline functionality for all customers. All our systems are now back to normal state.
Posted Apr 25, 2022 - 04:16 UTC
Update
While notification delivery issue has been fixed, we are continuing to monitor the impact of the fix on the incident timeline functionality. We will share further updates as they become available.
Posted Apr 25, 2022 - 04:05 UTC
Update
We have confirmed full recovery of our notification delivery functionality. Incident timeline functionality has returned to normal for most customers. We are continuing to monitor the outcome of the deployed fix.
Posted Apr 25, 2022 - 03:38 UTC
Monitoring
We have deployed a fix for the issue and are monitoring the state of our systems. An additional impact of the incident was incident timelines for new incidents being temporarily unavailable - this functionality is also in process of recovery.
Posted Apr 25, 2022 - 03:26 UTC
Investigating
PagerDuty is currently experiencing an issue with notification deliverability. This is impacting customers in our US service region. We are currently investigating the cause and scope of the issue.
Posted Apr 25, 2022 - 03:09 UTC
This incident affected: Web Application (Web Application (US)) and Notification Delivery (Notification Delivery (US)).