Failed Notifications, Issues around viewing log entries and incident timelines
Incident Report for PagerDuty
Postmortem

Summary

On March 20th, 2022, between 07:56 UTC and 09:43 UTC, PagerDuty experienced an operational incident that severely delayed incident and event notifications in the US Service Region.

During this time, notifications from incidents in the PagerDuty platform did not make it to their respective destinations via SMS, phone, push, or email. No events were lost or dropped during this time, and all notifications were eventually sent after a maximum delay of approximately two hours. This issue did not impact the viewing or updating of incidents in the Web UI, Mobile UI or REST API. 

What Happened

Our data streaming platform, which we use to publish events from our MySQL databases to our Kafka clusters, encountered a novel failure condition that required extended investigation and human intervention to correct. As a result of this failure, events destined for downstream micro-services could not flow successfully through the pipeline. While we continued to accept new incoming events and allow interaction with customer-facing web applications, processing stopped on downstream services, including the services responsible for sending out notifications. Events ingested during this time remained safely queued until our processing pipeline could act on them. 

Our engineers were able to remediate the failure by switching the upstream servers for the data streaming platform. This action yielded the desired outcomes, and our team observed service restoration. A backlog of events was then processed, which may have led to unexpected escalations and repeated notifications as our service processed events in a First In, First Out (FIFO) manner.      

What We Are Doing About This

Following this incident, our teams conducted a thorough post mortem investigation which identified a series of events that led to a failure of this nature. Our engineering teams have worked diligently to address these findings and ensure we've adequately guarded against this manner of failure from now on. The corrective actions included the following:

  • We've implemented a code change to our service container entry point to automate the correction of this failure mode.
  • We will open an issue with the upstream repository maintainers for this affected service to ensure the community knows about this distinct failure mode. 

We sincerely apologize for the delayed and unexpected notifications you or your teams experienced. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Mar 28, 2022 - 20:34 UTC

Resolved
We have resolved an incident where PagerDuty customers in the US service region[s] experienced issues with notifications, log entries, and incident timelines. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Mar 20, 2022 - 10:20 UTC
Update
We are continuing to monitor improvement in an incident where notifications were failing to be sent out. As part of this recovery, stale notifications are being sent out and this may result in delayed notifications to users on accounts.
Posted Mar 20, 2022 - 10:03 UTC
Monitoring
We are noticing signs of recovery and are continuing to monitor the incident.
Posted Mar 20, 2022 - 09:49 UTC
Update
We are continuing to investigate this incident. We will provide further updates within 15 minutes.
Posted Mar 20, 2022 - 09:27 UTC
Update
We are continuing to investigate this incident. We will provide further updates within 15 minutes.
Posted Mar 20, 2022 - 09:11 UTC
Update
We are continuing to investigate this incident. We will provide further updates within 15 minutes.
Posted Mar 20, 2022 - 08:56 UTC
Update
We are continuing to investigate an incident where PagerDuty customers in the US service region are experiencing issues with receiving notifications, adding responders, and viewing log entry and incident timeline information. We will provide further updates within 15 minutes.
Posted Mar 20, 2022 - 08:41 UTC
Update
We are continuing to investigate an incident where PagerDuty customers in the US service region are experiencing issues with receiving notifications. We will provide further updates within 15 minutes.
Posted Mar 20, 2022 - 08:25 UTC
Investigating
We are investigating an incident where some PagerDuty customers in the US service region are experiencing issues with receiving notifications. We will provide further updates within 15 minutes.
Posted Mar 20, 2022 - 08:11 UTC
This incident affected: Web Application (Web Application (US)), Notification Delivery (Notification Delivery (US)), and REST API (REST API (US)).