Event Investigation Issue Affecting Global Routing Keys
Incident Report for PagerDuty
Postmortem

Summary

On December 3, 2020 between 20:25 UTC and 21:55 UTC PagerDuty experienced a major incident that caused global events to be accepted but not processed within SLA.

What Happened

The Global Events service, responsible for processing all incoming global events, restarted repeatedly due to an unusual traffic pattern in combination with a change deployed the day before. All incoming events were still being ingested, but it caused a backlog of events to build up. The change did not appear related to the incident immediately, but ultimately, an investigation identified the correlation. A fix to temporarily disable the problematic functionality was deployed immediately after. After that, the service returned to a stable state again and successfully worked through the entire backlog of events.

What We Are Doing About This

We are prioritizing multiple action items to prevent an incident like this from happening again:

  • Permanent Solution: The temporary fix was already replaced with a permanent change to re-enable the temporarily disabled functionality again safely.
  • Error Visibility: We are improving our logging capabilities to reduce the Mean-Time-To-Resolution (MTTR) for an incident like this.
  • Monitoring Improvements: We are adding additional monitoring to the Global Events service to further reduce the Mean-Time-To-Acknowledge (MTTA).

We’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.

Posted Dec 09, 2020 - 15:18 UTC

Resolved
We are fully recovered and event ingestion is operating as expected.
Posted Dec 03, 2020 - 21:58 UTC
Monitoring
We are continuing to monitor progress towards full recovery.
Posted Dec 03, 2020 - 21:26 UTC
Identified
We are seeing signs of recovery and will continue to monitor progress.
Posted Dec 03, 2020 - 21:07 UTC
Investigating
We are currently experiencing issues with event ingestion with any events using global routing keys. Investigation is ongoing.
Posted Dec 03, 2020 - 20:54 UTC
This incident affected: Events API.