500 Errors on Website
Incident Report for PagerDuty
Postmortem

Summary

On July 15th from 18:43 UTC to 21:22 UTC PagerDuty experienced a degradation in its ability to process web requests to pagerduty.com. As a result, customers experienced delays while using the web portal.

What Happened

We received a very large increase in traffic to non-existent paths within the PagerDuty web application. This caused one of our web routing services to become resource constrained and intermittently restart due to failing infrastructure health and resource usage checks. There was a slowdown in processing of web requests from our customers, with a small percentage of requests receiving HTTP 500 responses for pages in the web application.

We provisioned more capacity to handle the increase in traffic which resolved the instability in the platform and restored nominal performance.

What Are We Doing About This?

We are instituting more meaningful load metric checking and alerting for our routing services, along with improved monitoring to identify and proactively block abnormal traffic before it impacts our systems’ ability to service normal requests.

Posted Oct 13, 2021 - 00:20 UTC

Resolved
This incident has been resolved.
Posted Jul 15, 2021 - 23:38 UTC
Monitoring
We have recovered but are actively monitoring service metrics as we continue to investigate ways to prevent a recurrence of this issue.
Posted Jul 15, 2021 - 21:18 UTC
Update
We are still investigating the issue. Currently, there should be no impact.
Posted Jul 15, 2021 - 20:47 UTC
Investigating
We are experiencing an issue causing 500 errors on the website. We are investigating potential fixes.
Posted Jul 15, 2021 - 20:21 UTC
This incident affected: Web Application (Web Application (US)).