On Monday, November 8th 22:30 – 23:00 UTC and Tuesday, November 9th 22:00 – 23:30 UTC, PagerDuty experienced an incident that caused alert severities to be set to “critical” for a small percentage of event requests.
Prior to the incident, PagerDuty engineers prepared a release candidate whose request parameter validation was refactored for an internal endpoint that handles events. On Monday, out of caution, the release candidate was deployed to a canary fleet that served approximately 6% of event traffic. Though it had been tested against many scenarios in multiple test environments, one of these parameters – namely `severity` – was missed. This resulted in events processed by the canary fleet to default to a severity of “critical,” even if requests to our public endpoints had the parameter set.
After shutting down the Monday canary fleet, our engineering teams did not find any anomalies, and proceeded to deploy the faulty release candidate again to a canary fleet on Tuesday, this time with slightly more traffic and for a longer period. Shortly after the canary was shut down on Tuesday, our Support team initiated a major incident to investigate the incorrect alert severity behavior reported by customers. Our engineering teams identified the release candidate canary as the primary cause, and confirmed that the issue was isolated to requests processed on that fleet.
We have identified deficiencies in the automated testing and telemetry around the affected endpoint, and will be working to improve them.
We greatly apologize for the inconvenience that this may have caused. For any questions, comments or concerns, please reach out to support@pagerduty.com.