Between 18:55 UTC and 22:23 UTC on Monday, April 18, PagerDuty experienced an incident that caused delays in sending responder notifications, delays in resolving incidents, and delays in escalating incidents across our US service region. During this time, other notifications continued to work. PagerDuty’s APIs, as well as web and mobile apps, were otherwise functional and remained available.
An internal cleanup action on expired accounts put extreme pressure on one of PagerDuty’s job queueing systems. This system also processes customer-centric actions such as incident escalations, resolution, and responder requests. The system attempts to provide fairness between accounts, but the large number of jobs from a large number of accounts resulted in slow queries to our databases and slow decisions about which job to process.
PagerDuty became aware of this issue at about 20:30 UTC and initiated an incident response. Responders were able to manually remove the internal cleanup jobs and we started seeing recovery about 21:45 UTC. All queued work was finished by 22:23 UTC. All delayed actions were processed during the recovery period. Customers will have noticed their delayed notifications delivered and pending incident escalation/resolution actions processed during the recovery period.
We have moved the work that overwhelmed the affected job queueing system to another system that is more appropriate for internal processes. We are reducing our dependence on this system by moving other work away from it and replacing this part of our architecture with more scalable solutions.
We are also investigating ways of improving scheduling performance specifically for situations where there is a large number of jobs that need to be processed from a large number of accounts. We are also looking at tooling to help quickly remove jobs if necessary.
Finally, we have re-examined our monitoring and have made some fixes that reduce time to detection of delays in processing work through this system.
We sincerely apologize for the impact these delays in responder notifications and incident actions had on you and your teams. We understand how vital our platform is for our customers, and we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to firstname.lastname@example.org.