Issue with Event Processing
Incident Report for PagerDuty
Postmortem

Summary

On May 23rd from 18:36 UTC to 19:04 UTC, PagerDuty experienced a degradation in its ability to process event data for inbound integrations. As a result, incident creation and notifications were delayed during this time. The mobile app, our REST API, and our Web UI were all unaffected.

What Happened?

This degradation was caused by a bad code deploy. Events processed under a specific configuration that did not come up in our automated test or staging environments resulted in errors under the new code deployment. This in turn caused our event processing systems to experience higher than normal error rates, triggering auto remediation measures that slowed down all event processing for a period of time.

This issue was resolved by deploying a last-known-good version of our code and reprocessing all traffic that failed to processes under the bugged code. No events were dropped.

What Are We Doing About This?

We will be working to re-introduce measures to automatically fail code deployments that raise error rates, even if they pass all other automated testing.

We have also added automated test coverage to address the specific configuration that we discovered as part of the root cause of this incident.

Posted May 29, 2019 - 20:09 UTC

Resolved
This issue has been resolved and event ingestion is back to normal. All queued events have now been successfully processed.
Posted May 23, 2019 - 19:04 UTC
Monitoring
We have deployed a solution and are currently monitoring event ingestion. We are re-enqueing any failed events to be processed.
Posted May 23, 2019 - 18:56 UTC
Identified
We are currently experiencing an issue with event processing through our Events API and with email integrations. Our engineering team is actively investigating solutions.
Posted May 23, 2019 - 18:50 UTC
This incident affected: Events API.