On November 1, from 18:10 to 19:44 UTC, PagerDuty experienced a major incident that caused degradation to event ingestion, event processing, and Web UI and REST API requests in the US service region. At 18:10 UTC, a deployment was done in one of the services responsible for the processing of events. Our system monitors proactively notified our engineers of a problem and they began investigating. Between 18:10 and 18:40 UTC, our Events API returned a higher rate of 429/500 HTTP code responses, and parts of the web UI and REST API that require event details also returned 500 HTTP error codes. At 18:41 UTC, a revert deployment had completed and error rates gradually returned to normal. From 18:41 to 19:44 UTC, a backlog of events which had been sent to a dead letter queue was being reprocessed.
The incident was caused by a change to traffic mirroring in the event-processing service, which uncovered a bug in another service responsible for storing the events. Invalid requests to store events from the traffic mirroring resulted in HTTP 500 responses due to a missing validation check. Consequently, the smart health checks in place caused the storage service to restart its allocations which impacted Web UI/API and Events API calls. This, in turn, caused slowdowns in the processing of notifications and incidents.
To resolve immediate customer impact at 18:40 UTC, our on-call responders reverted the problematic change that had been made to the event-processing service, thereby restoring event processing to its full capacity. The active incident resolved immediately, resulting in a full restoration of normal functionality for new incoming events and the Events API, as well as for the Web UI and REST API. However, there was a backlog of events in a dead letter queue which had yet to be retried. That backlog of events was successfully processed by 19:44 UTC.
Between 18:10 and 19:44 UTC, event ingestion was impacted as follows:
Following this incident, our teams conducted a thorough post-mortem investigation which identified several factors that contributed to this incident. We are committed to addressing each of those factors and preventing incident impact from affecting the service we are providing to our customers. The actions we are taking are these:
We sincerely apologize for the impact these delayed notifications had on you or your teams. We understand how vital our platform is for our customers. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to email@example.com.