Global events failing to create alerts and incidents
Incident Report for PagerDuty
Postmortem

Summary

On Monday, May 2nd, between 15:25 UTC and 16:50 UTC, PagerDuty experienced an incident that caused an issue with the processing of global events.

During this time period, all global events – events sent to Global Rulesets and Event Orchestrations – were accepted by our system, but most were dropped before reaching the intended customers.

What Happened

A deployment caused traffic sent to Global Rulesets and Event Orchestrations to be treated as test data; this in turn caused these global events to be dropped before being processed further in the pipeline. The intended change of the deployment was to copy a specific subset of events as test data, but due to a separate recent change in our service's configuration structure, global events were affected instead.

During the time of impact, 75% of global events were dropped before reaching our internal upstream services. These events failed silently, meaning that systems sending the events received a successful response, but the events did not trigger alerts, incidents, and/or notifications, and were therefore not visible in the web UI, mobile UI, nor REST API.

Our engineers were able to remediate the failure by redeploying a previous version of the codebase. This action restored the processing path of global events and returned our system back to a healthy state. However, we could not recover all of the global events that were impacted. We were able to reprocess 26% of the dropped global events as suppressed events - no notifications were sent out, but they are visible for historical purposes.

What We Are Doing About This

Following this incident, our teams conducted a postmortem investigation, which identified the events that contributed to this incident, as well as what we can do to ensure similar incidents do not happen again. The corrective actions included the following:

  • We've added monitoring to notify us of cases where traffic is being dropped before reaching upstream services.
  • We've implemented deployment strategies (automated canary analysis) to reduce the likelihood of bad deployments impacting our customers. 
  • We’ve decreased the time it takes to roll back our services.
  • We've improved our internal service configurations to make them more straightforward and less error-prone.

We sincerely apologize for these failed events, the consequently unsent notifications, and the impact this had on you and your teams. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted May 09, 2022 - 21:47 UTC

Resolved
This incident has been resolved.
Posted May 02, 2022 - 17:03 UTC
Monitoring
We have fixed the issue and all new events are now being processed normally. We are continuing to monitor.
Posted May 02, 2022 - 16:59 UTC
Identified
We’ve been experiencing an issue whereby all events sent to global rulesets and orchestrations have been failing to create alerts and incidents since approximately 15:25 UTC. We are working to remediate the issue.
Posted May 02, 2022 - 16:51 UTC
This incident affected: Events API (Events API (US), Events API (EU)).