Increased Latency for Web and API Requests
Incident Report for PagerDuty
Postmortem

Summary

From April 19th, 2021 at 20:16 UTC to 21:44 UTC PagerDuty experienced a major incident that impacted customer site availability.  During this time, event processing and notifications were unaffected.

What Happened

At 20:16 UTC, a major incident was manually triggered in response to elevated error metrics and graphs. Analysis and discussion took place, due to earlier database activity and similarity to previous events early focus was on database performance.

At 20:36 UTC, we determined the database was a red herring and initiated an attempt to remediate by redeploying an older version of our core application.
While redeploying, at 20:38 UTC, the problem was roughly isolated to increased load in a particular configuration generating slow queries.

Attempts to isolate the problematic configuration escalated, until at 21:13 UTC we isolated the cause of the increased load and disabled it.

Following this, there was gradual restoration of core application and database health.

At 21:35 UTC PagerDuty determined that all systems had recovered.

At 21:44 UTC the incident was officially resolved.

What Are We Doing About This

We're working on improving our time from observing a potential anomaly to associating it with specific changes in traffic.

We're working on improvements around our traffic isolation, rate limiting and load remediation stories.

Finally we would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Apr 21, 2021 - 18:22 UTC

Resolved
This incident has been resolved.
Posted Apr 19, 2021 - 21:44 UTC
Monitoring
We have identified the cause of the latency issues and have implemented a fix. We are monitoring recovery.
Posted Apr 19, 2021 - 21:38 UTC
Investigating
We are experiencing issues with increased latency with web and API requests. We are currently investigating.
Posted Apr 19, 2021 - 20:35 UTC
This incident affected: REST API (REST API (US)), Mobile Application (Mobile Application (US)), and Web Application (Web Application (US)).