Incident affecting Web UI, REST API, and Events API
Incident Report for PagerDuty
Postmortem

Summary

On November 1, from 18:10 to 19:44 UTC, PagerDuty experienced a major incident that caused degradation to event ingestion, event processing, and Web UI and REST API requests in the US service region. At 18:10 UTC, a deployment was done in one of the services responsible for the processing of events. Our system monitors proactively notified our engineers of a problem and they began investigating. Between 18:10 and 18:40 UTC, our Events API returned a higher rate of 429/500 HTTP code responses, and parts of the web UI and REST API that require event details also returned 500 HTTP error codes. At 18:41 UTC, a revert deployment had completed and error rates gradually returned to normal. From 18:41 to 19:44 UTC, a backlog of events which had been sent to a dead letter queue was being reprocessed.

What Happened

The incident was caused by a change to traffic mirroring in the event-processing service, which uncovered a bug in another service responsible for storing the events. Invalid requests to store events from the traffic mirroring resulted in HTTP 500 responses due to a missing validation check. Consequently, the smart health checks in place caused the storage service to restart its allocations which impacted Web UI/API and Events API calls. This, in turn, caused slowdowns in the processing of notifications and incidents. 

To resolve immediate customer impact at 18:40 UTC, our on-call responders reverted the problematic change that had been made to the event-processing service, thereby restoring event processing to its full capacity. The active incident resolved immediately, resulting in a full restoration of normal functionality for new incoming events and the Events API, as well as for the Web UI and REST API. However, there was a backlog of events in a dead letter queue which had yet to be retried. That backlog of events was successfully processed by 19:44 UTC.

Between 18:10 and 19:44 UTC, event ingestion was impacted as follows: 

  • Approximately 4% of Events API requests were returned with HTTP 5XX error responses.
  • 4.8% of notifications were delivered outside of SLA.

What We Are Doing About It

Following this incident, our teams conducted a thorough post-mortem investigation which identified several factors that contributed to this incident. We are committed to addressing each of those factors and preventing incident impact from affecting the service we are providing to our customers. The actions we are taking are these: 

  1. We are revising the existing approach to health checks in the event-storage service to ensure they do not create a negative impact on the service’s clients.
  2. We are adding further code changes in the event-storage service to improve request validation checks to make the service more resilient to this failure mode.
  3. We are creating new test cases to catch an escaped defect and thereby prevent the consequent failure state from being achieved.
  4. We have completed a tooling improvement which allows reprocessing of events in the dead letter queue at a faster rate.

 

We sincerely apologize for the impact these delayed notifications had on you or your teams. We understand how vital our platform is for our customers. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Nov 11, 2022 - 22:16 UTC

Resolved
We have now fully resolved an incident where PagerDuty customers in the US service region experienced issues with the Events API, the web UI, and the REST API. All previously backlogged events have been processed, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Nov 01, 2022 - 19:43 UTC
Update
New events sent to the Events API are now being processed normally, and we are continuing to work through the backlog of events that were accepted during the time of impact but not yet processed. Customers may receive notifications for old events as backlogged events continue to be processed. We will provide another update within 30 minutes regarding progress on the backlog.
Posted Nov 01, 2022 - 19:28 UTC
Monitoring
We are seeing recovery and are monitoring improvement in an incident affecting the web UI, REST API, and Events API. We have deployed a fix, and the web UI and REST API are now behaving normally. Accepted events through the Events API are continuing to be processed and we expect systems to continue to improve. Any requests that were returned 500 errors will need to be resent by customers. We will provide an update within 30 minutes regarding the processing of the backlog of events.
Posted Nov 01, 2022 - 18:59 UTC
Identified
We are investigating an incident where many PagerDuty customers in the US service region are experiencing issues with the web UI, the REST API, and the Events API. Impacted customers may see slow load times and 500 errors in the UI, and 500 and 429 HTTP responses to API requests. We will provide further updates within 20 minutes.
Posted Nov 01, 2022 - 18:43 UTC
Investigating
We are investigating potential issues with the Events API within PagerDuty's US service region. On confirmation, we will update with further impact and severity within 15 minutes.
Posted Nov 01, 2022 - 18:22 UTC
This incident affected: Events API (Events API (US)), REST API (REST API (US)), and Web Application (Web Application (US)).