Some incidents not resolving or escalating, responder requests not sending
Incident Report for PagerDuty
Postmortem

Summary

Between 18:55 UTC and 22:23 UTC on Monday, April 18, PagerDuty experienced an incident that caused delays in sending responder notifications, delays in resolving incidents, and delays in escalating incidents across our US service region. During this time, other notifications continued to work. PagerDuty’s APIs, as well as web and mobile apps, were otherwise functional and remained available.

What Happened

An internal cleanup action on expired accounts put extreme pressure on one of PagerDuty’s job queueing systems. This system also processes customer-centric actions such as incident escalations, resolution, and responder requests. The system attempts to provide fairness between accounts, but the large number of jobs from a large number of accounts resulted in slow queries to our databases and slow decisions about which job to process.

PagerDuty became aware of this issue at about 20:30 UTC and initiated an incident response. Responders were able to manually remove the internal cleanup jobs and we started seeing recovery about 21:45 UTC. All queued work was finished by 22:23 UTC. All delayed actions were processed during the recovery period. Customers will have noticed their delayed notifications delivered and pending incident escalation/resolution actions processed during the recovery period.

What We Are Doing About This

We have moved the work that overwhelmed the affected job queueing system to another system that is more appropriate for internal processes. We are reducing our dependence on this system by moving other work away from it and replacing this part of our architecture with more scalable solutions.

We are also investigating ways of improving scheduling performance specifically for situations where there is a large number of  jobs that need to be processed from a large number of accounts. We are also looking at tooling to help quickly remove jobs if necessary.

Finally, we have re-examined our monitoring and have made some fixes that reduce time to detection of delays in processing work through this system.

We sincerely apologize for the impact these delays in responder notifications and incident actions had on you and your teams. We understand how vital our platform is for our customers, and we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Apr 29, 2022 - 21:37 UTC

Resolved
We have now fully recovered. All incident functionality has returned to normal.
Posted Apr 18, 2022 - 22:23 UTC
Monitoring
Remedial measures have been implemented and we are now seeing signs of recovery. Some previously failed responder request notifications will be sent late. We’re continuing to monitor.
Posted Apr 18, 2022 - 21:49 UTC
Identified
We have identified the cause of the issue and are in the process of pursuing remediation.
Posted Apr 18, 2022 - 21:15 UTC
Update
We now understand that some responder notifications have also been failing to be sent out. We are continuing to investigate.
Posted Apr 18, 2022 - 21:08 UTC
Investigating
We are currently experiencing an issue whereby some incidents are not getting resolved and escalated as expected. We are investigating.
Posted Apr 18, 2022 - 20:55 UTC
This incident affected: Web Application (Web Application (US), Web Application (EU)).