Delayed incident creation
Incident Report for PagerDuty
Postmortem

Summary

On July 19th, between 4:36 PM UTC and 5:50 PM UTC, PagerDuty experienced delays processing API events in both the US & EU regions, with events from Microsoft Azure Alerts Integration delayed for the entire duration of the incident. The incident was caused by an Azure configuration change that triggered failsafes on our side. Those failsafes, in turn, caused slow downs in event processing for inbound, API-bourn events. In response, our on-call responders reverted the change made to the Azure integration. This resulted in a full recovery.

What Happened

As part of improving the operational efficiency of the event ingestion pipeline at PagerDuty, on June 29th, changes were made to the service that transforms incoming events from any integration, sent in the integration specific format, into a common PagerDuty format. The changes were around how integration specific transformation configurations would be separately packaged and deployed on the backend for them to be executed at the time of processing events against those integrations. The changes, as it turned out, introduced some incompatibility with few of our integrations. However, the incompatibility wouldn’t come into effect until changes were made to the integrations themselves.

On the day of the incident, configuration changes were made to the Azure integration, which activated the latent issue on the service, resulting in failures when processing Azure events. Those failures then triggered failsafes on the pipeline, which caused slow downs in event processing for all inbound, API-bourn events.

After being alerted about the errors by our monitoring tools, our on-call responders reverted the changes to the service and redeployed the Azure configurations. This resulted in full service restoration. The earlier failed Azure events were also successfully reprocessed.

What We Are Doing About This

Following the incident, our teams conducted a thorough investigation into the factors leading up to the incident and have identified several action items for us to undertake to ensure incidents like these don't happen in the future. The action items include the following:

  • Fixing the test infrastructure that should have identified the problem in our pre-production environment
  • Enhancing the test suites on the service that executes event transformations
  • Introducing integration specific circuit-breakers to the pipeline to contain the impact to just the broken integrations
  • Exploring the options of canarying and gradually rolling out the integration specific configurations after their deployment in production
  • Adding additional monitors to ensure faster turnaround time

We apologize for our delays in processing these events and the impact on you and your teams. As always, we stand by our commitment to providing the industry's most reliable and resilient platform. If you have any questions, please reach out to support@pagerduty.com.

Posted Jul 26, 2022 - 20:12 UTC

Resolved
This incident has been resolved. Incidents are no longer delayed.
Posted Jul 19, 2022 - 18:03 UTC
Monitoring
We have deployed remediation measures and are currently monitoring.
Posted Jul 19, 2022 - 17:55 UTC
Identified
We have identified the issue and are pursuing remediation strategies.
Posted Jul 19, 2022 - 17:46 UTC
Update
We are continuing to investigate this issue.
Posted Jul 19, 2022 - 17:22 UTC
Investigating
We are investigating potential issues with incident creation effecting some US and EU accounts.
Posted Jul 19, 2022 - 17:03 UTC
This incident affected: Events API (Events API (US), Events API (EU)).