On December 3, 2020 between 20:25 UTC and 21:55 UTC PagerDuty experienced a major incident that caused global events to be accepted but not processed within SLA.
The Global Events service, responsible for processing all incoming global events, restarted repeatedly due to an unusual traffic pattern in combination with a change deployed the day before. All incoming events were still being ingested, but it caused a backlog of events to build up. The change did not appear related to the incident immediately, but ultimately, an investigation identified the correlation. A fix to temporarily disable the problematic functionality was deployed immediately after. After that, the service returned to a stable state again and successfully worked through the entire backlog of events.
We are prioritizing multiple action items to prevent an incident like this from happening again:
We’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to firstname.lastname@example.org with these questions.