On July 15th from 18:43 UTC to 21:22 UTC PagerDuty experienced a degradation in its ability to process web requests to pagerduty.com. As a result, customers experienced delays while using the web portal.
We received a very large increase in traffic to non-existent paths within the PagerDuty web application. This caused one of our web routing services to become resource constrained and intermittently restart due to failing infrastructure health and resource usage checks. There was a slowdown in processing of web requests from our customers, with a small percentage of requests receiving HTTP 500 responses for pages in the web application.
We provisioned more capacity to handle the increase in traffic which resolved the instability in the platform and restored nominal performance.
We are instituting more meaningful load metric checking and alerting for our routing services, along with improved monitoring to identify and proactively block abnormal traffic before it impacts our systems’ ability to service normal requests.