500 Errors on Website
Incident Report for PagerDuty


On October 12, 2021 from a period between 17:38 UTC to 19:32 UTC we experienced site-wide performance degradation and elevated 5xx errors that prevented users from accessing PagerDuty. Notification delivery and event ingestion were unaffected during this incident. 

What Happened?

On October 12, 2021 starting at 17:38 UTC there was a spike in response times to requests from one of our integrations. These long running requests began consuming our available connections. As a result, users accessing PagerDuty via the web UI began experiencing slow load times or 500 error pages. The incident response team did not initially suspect the integration to be a root cause and instead rolled back a deployment that had completed just prior to the detection of slow response times and elevated 5xx errors. Soon after the deployment completed we saw response times and error rates return to normal, which coincided with a drop in requests received by the responsible integration. However, response times increased again shortly thereafter. We then identified the integration and the specific account that was the true source of the traffic causing the problem. We rate limited requests to that integration’s endpoint for the customer in question by using available controls. Once the integration was blocked, response times returned to normal. 

What Are We Doing About This?

During the incident it was discovered that request connections were being held open far longer than they should have been due to generous queuing and timeouts configuration. Our infrastructure team will be reviewing these settings and making adjustments in order to prevent outages due to connection strain. 

As for the endpoint that services integration webhook requests, additional telemetry is being added to identify the problem sooner. Optimizations are being added to better isolate and prevent long running requests from impacting other connections. Finally, while the incident response team did post updates, there was a delay in the initial update of the status page. We are reviewing our processes to ensure our customers receive information in a timely manner. For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted Oct 29, 2021 - 06:21 UTC

We have fully recovered. All systems are running properly.
Posted Oct 12, 2021 - 18:41 UTC
We are starting to see signs of recovery.
Posted Oct 12, 2021 - 18:32 UTC
We are experiencing an issue which is causing 500 errors on the website. Mitigation is in progress.
Posted Oct 12, 2021 - 17:58 UTC
This incident affected: Web Application (Web Application (US)).