On July 19th, between 4:36 PM UTC and 5:50 PM UTC, PagerDuty experienced delays processing API events in both the US & EU regions, with events from Microsoft Azure Alerts Integration delayed for the entire duration of the incident. The incident was caused by an Azure configuration change that triggered failsafes on our side. Those failsafes, in turn, caused slow downs in event processing for inbound, API-bourn events. In response, our on-call responders reverted the change made to the Azure integration. This resulted in a full recovery.
As part of improving the operational efficiency of the event ingestion pipeline at PagerDuty, on June 29th, changes were made to the service that transforms incoming events from any integration, sent in the integration specific format, into a common PagerDuty format. The changes were around how integration specific transformation configurations would be separately packaged and deployed on the backend for them to be executed at the time of processing events against those integrations. The changes, as it turned out, introduced some incompatibility with few of our integrations. However, the incompatibility wouldn’t come into effect until changes were made to the integrations themselves.
On the day of the incident, configuration changes were made to the Azure integration, which activated the latent issue on the service, resulting in failures when processing Azure events. Those failures then triggered failsafes on the pipeline, which caused slow downs in event processing for all inbound, API-bourn events.
After being alerted about the errors by our monitoring tools, our on-call responders reverted the changes to the service and redeployed the Azure configurations. This resulted in full service restoration. The earlier failed Azure events were also successfully reprocessed.
Following the incident, our teams conducted a thorough investigation into the factors leading up to the incident and have identified several action items for us to undertake to ensure incidents like these don't happen in the future. The action items include the following:
We apologize for our delays in processing these events and the impact on you and your teams. As always, we stand by our commitment to providing the industry's most reliable and resilient platform. If you have any questions, please reach out to firstname.lastname@example.org.