Issues with web/api and events
Incident Report for PagerDuty
Postmortem

Summary

On January 12, 2020, from 22:35 to 22:55 UTC, PagerDuty experienced a degradation of its web services and a delay in event processing.

 During this time, components of the PagerDuty web application related to incidents, alerts and log entries (i.e. incident timeline) were inaccessible or unusable, and events sent to PagerDuty via the events API were delayed by up to 20 minutes.

What Happened

 At 22:35 UTC, a change was deployed to PagerDuty web services containing a bug that was not noticed in code review and did not trigger test failures. Additional tests that may have revealed the bug were not performed, as the risk associated with the change was erroneously assessed as being low. Moreover, the health check performed in the canary step of the automated deploy process was not able to detect this mode of failure, and so the deployment continued.

 PagerDuty engineers were notified at 22:36 UTC about the rise in error rate, and began investigating. At 22:45 UTC, the cause of the incident was identified, and deployment of a revert to the change began.

 At 22:55 UTC, the deployment completed, and full recovery was achieved.

What Are We Doing About This

 We are working to improve our internal monitoring to facilitate swifter identification and reversion of deployments that introduce regressions. We are also working on improvements to our canary deploy process to allow detection of more diverse failure modes, to help us catch and stop regressions earlier in deployment.

We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Jan 15, 2021 - 18:29 UTC

Resolved
We have fully recovered.
Posted Jan 12, 2021 - 23:06 UTC
Monitoring
We are seeing recovery and will continue to monitor progress.
Posted Jan 12, 2021 - 23:03 UTC
Investigating
We are currently experiencing issues with our api/web UI and events. We are investigating the full impact and cause.
Posted Jan 12, 2021 - 22:52 UTC
This incident affected: Notification Delivery (Notification Delivery (US)), REST API (REST API (US)), Web Application (Web Application (US)), and Events API (Events API (US)).