From April 19th, 2021 at 20:16 UTC to 21:44 UTC PagerDuty experienced a major incident that impacted customer site availability. During this time, event processing and notifications were unaffected.
At 20:16 UTC, a major incident was manually triggered in response to elevated error metrics and graphs. Analysis and discussion took place, due to earlier database activity and similarity to previous events early focus was on database performance.
At 20:36 UTC, we determined the database was a red herring and initiated an attempt to remediate by redeploying an older version of our core application.
While redeploying, at 20:38 UTC, the problem was roughly isolated to increased load in a particular configuration generating slow queries.
Attempts to isolate the problematic configuration escalated, until at 21:13 UTC we isolated the cause of the increased load and disabled it.
Following this, there was gradual restoration of core application and database health.
At 21:35 UTC PagerDuty determined that all systems had recovered.
At 21:44 UTC the incident was officially resolved.
We're working on improving our time from observing a potential anomaly to associating it with specific changes in traffic.
We're working on improvements around our traffic isolation, rate limiting and load remediation stories.
Finally we would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org.