Outage on Web, REST, and Events APIs
Incident Report for PagerDuty
Postmortem

Summary 

On October 20, 2022, from approximately 10:40 UTC and lasting until 10:42 UTC, some customers could not connect, retrieving 500s errors with web and mobile sites as well as the REST and Events APIs. Webhooks and notifications were also delayed, and event ingestion was throttled for some customers. We continued to process inbound emails without interruption throughout the incident.

What Happened 

The outage was precipitated by network connectivity issues starting at 10:28 UTC and lasting until 10:44 UTC, between our service region and one of the disaster recovery regions. Leading indicators of degradation in our system did not appear until 10:33, with the effects limited to internal systems. Customer impact began at 10:40 UTC. At that time, critical path services with hard dependencies on the service region restarted and could not start up correctly due to network inaccessibility. These services, as well as all impacted internal services, began self-healing as network connectivity was restored. Alerts notified responders, and a major incident was automatically triggered at 10:44 UTC.  We observed recovery starting at 10:42 UTC, which coincided with the end of failed customer events. Some services took several minutes to recover fully, and clean-up actions and throttle removals continued until as late as 11:30 UTC.

What are we doing about this?

We have identified several latent issues that negatively impacted reliability in the face of network connectivity problems and are working to remove those. Also we plan to  improve our network connectivity detection. Finally we will reproduce this scenario in a non-production environment to verify future fault tolerance. We apologize for the inconvenience that this has caused. For any questions, comments, or concerns, please contact us at support@pagerduty.com

Posted Oct 28, 2022 - 23:26 UTC

Resolved
We have identified and resolved an incident where some PagerDuty customers in the US service region experienced issues with delays in notifications, UI and APIs. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Oct 20, 2022 - 11:30 UTC
Update
We are continuing to work on a fix for this issue.
Posted Oct 20, 2022 - 11:17 UTC
Identified
We are still investigating an incident where some PagerDuty customers in the US service region are experiencing issues with delays in notifications, UI and APIs. We will provide further updates within 20 minutes.
Posted Oct 20, 2022 - 11:16 UTC
Investigating
We are investigating potential issues with delays in notifications, UI and APIs within PagerDuty. On confirmation, we will update with further impact and severity within 15 minutes.
Posted Oct 20, 2022 - 11:01 UTC
This incident affected: REST API (REST API (US)), Web Application (Web Application (US)), and Notification Delivery (Notification Delivery (US)).