On May 23rd from 18:36 UTC to 19:04 UTC, PagerDuty experienced a degradation in its ability to process event data for inbound integrations. As a result, incident creation and notifications were delayed during this time. The mobile app, our REST API, and our Web UI were all unaffected.
This degradation was caused by a bad code deploy. Events processed under a specific configuration that did not come up in our automated test or staging environments resulted in errors under the new code deployment. This in turn caused our event processing systems to experience higher than normal error rates, triggering auto remediation measures that slowed down all event processing for a period of time.
This issue was resolved by deploying a last-known-good version of our code and reprocessing all traffic that failed to processes under the bugged code. No events were dropped.
We will be working to re-introduce measures to automatically fail code deployments that raise error rates, even if they pass all other automated testing.
We have also added automated test coverage to address the specific configuration that we discovered as part of the root cause of this incident.