On January 12, 2020, from 22:35 to 22:55 UTC, PagerDuty experienced a degradation of its web services and a delay in event processing.
During this time, components of the PagerDuty web application related to incidents, alerts and log entries (i.e. incident timeline) were inaccessible or unusable, and events sent to PagerDuty via the events API were delayed by up to 20 minutes.
At 22:35 UTC, a change was deployed to PagerDuty web services containing a bug that was not noticed in code review and did not trigger test failures. Additional tests that may have revealed the bug were not performed, as the risk associated with the change was erroneously assessed as being low. Moreover, the health check performed in the canary step of the automated deploy process was not able to detect this mode of failure, and so the deployment continued.
PagerDuty engineers were notified at 22:36 UTC about the rise in error rate, and began investigating. At 22:45 UTC, the cause of the incident was identified, and deployment of a revert to the change began.
At 22:55 UTC, the deployment completed, and full recovery was achieved.
We are working to improve our internal monitoring to facilitate swifter identification and reversion of deployments that introduce regressions. We are also working on improvements to our canary deploy process to allow detection of more diverse failure modes, to help us catch and stop regressions earlier in deployment.
We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org.