Issues with alert severity
Incident Report for PagerDuty
Postmortem

Summary

On Monday, November 8th 22:30 – 23:00 UTC and Tuesday, November 9th 22:00 – 23:30 UTC, PagerDuty experienced an incident that caused alert severities to be set to “critical” for a small percentage of event requests.

What Happened?

Prior to the incident, PagerDuty engineers prepared a release candidate whose request parameter validation was refactored for an internal endpoint that handles events. On Monday, out of caution, the release candidate was deployed to a canary fleet that served approximately 6% of event traffic. Though it had been tested against many scenarios in multiple test environments, one of these parameters – namely `severity` – was missed. This resulted in events processed by the canary fleet to default to a severity of “critical,” even if requests to our public endpoints had the parameter set.

After shutting down the Monday canary fleet, our engineering teams did not find any anomalies, and proceeded to deploy the faulty release candidate again to a canary fleet on Tuesday, this time with slightly more traffic and for a longer period. Shortly after the canary was shut down on Tuesday, our Support team initiated a major incident to investigate the incorrect alert severity behavior reported by customers. Our engineering teams identified the release candidate canary as the primary cause, and confirmed that the issue was isolated to requests processed on that fleet.

What Are We Doing About This?

We have identified deficiencies in the automated testing and telemetry around the affected endpoint, and will be working to improve them.

We greatly apologize for the inconvenience that this may have caused. For any questions, comments or concerns, please reach out to support@pagerduty.com.

Posted Nov 17, 2021 - 00:24 UTC

Resolved
This incident has been resolved.
Posted Nov 10, 2021 - 01:30 UTC
Update
We are no longer seeing incorrect status escalations (from warning to critical).
Posted Nov 10, 2021 - 01:30 UTC
Update
We are engaging additional teams to assist with the issue and will continue to share updates as our investigation proceeds.
Posted Nov 10, 2021 - 01:04 UTC
Update
We are continuing our investigation of incorrect alert severity and are evaluating several potential sources of the issue.
Posted Nov 10, 2021 - 00:30 UTC
Investigating
We are investigating reports of incidents being incorrectly escalated in severity. We will update when we have more information.
Posted Nov 10, 2021 - 00:00 UTC
This incident affected: Notification Delivery (Notification Delivery (US), Notification Delivery (EU)) and Events API (Events API (US), Events API (EU)).