Increased error rates in Web
Incident Report for PagerDuty
Postmortem

Summary

Between 15:05 and 15:44 UTC on Tuesday, April 26th, 2022, PagerDuty experienced an incident in the US service region that caused requests to our website to fail at an average of 7% over 7 minutes of traffic, with a peak of 24% request failure rate. Affected customers would have seen the Web UI error pages for the duration of the incident. The only impacted area was the Web UI in the US service region. The Web UI in the EU service region, and the REST APIs, Events API, mobile apps, notification delivery across both US and EU service regions were unaffected during this time. No customer data was lost during this incident.

What Happened

At 15:05 UTC, unusual traffic began on our system that services website traffic in the US service region. An issue with an integration, combined with an unusual traffic pattern, caused performance degradation on traffic to the website. This issue caused a specific endpoint to be particularly slow. An additional UI issue with retries sent excessive traffic to the slow endpoint. This created a resource bottleneck that resulted in an increase in error rate in web requests. 

PagerDuty became aware of this issue at about 15:09 UTC and initiated an incident response. Responders were able to stop the source of the unusual traffic, and we started seeing recovery at about 15:33 UTC. Customers will have noticed increased error rates on our Web UI during the incident period. Over the full duration, 2% of requests failed, with the greatest impact from 15:25 to 15:26 during which 24% of requests were affected.

What Are We Doing About It

We are investigating ways of improving performance on the system that serves website traffic, to prevent situations where unusual traffic can cause issues for all customers in a service region. We have added a new limit to the specific endpoint involved in this incident to prevent unusual requests from affecting customers. We are also investigating changes to integrations and the Web UI to avoid creating the unusual request patterns that triggered the failures.

We sincerely apologize for the issues you or your teams experienced. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Jul 19, 2022 - 18:57 UTC

Resolved
We have now recovered and are monitoring the situation internally.
Posted Apr 26, 2022 - 15:44 UTC
Investigating
We are seeing increased error rates in our web application. We are investigating. We will update in next 15 minutes.
Posted Apr 26, 2022 - 15:36 UTC
This incident affected: Web Application (Web Application (US), Web Application (EU)).