Issue affecting Web Application
Incident Report for PagerDuty
Postmortem

Summary

On April 30th, between 21:26 and 21:49 UTC, a portion of requests to the PagerDuty web application and REST API were met with status 500 responses. Events sent to the Events API were also delayed.

The impact started with a small percentage of requests and peaked between 21:35 and 21:43 UTC, during which 100% of requests received Internal Server Error responses. The issue was fully resolved at 21:49, and all remaining unprocessed events were fully processed by 22:00 UTC.

What happened

As part of an upgrade to the PagerDuty web application and its REST API, a dependency used only in unit tests was intentionally removed. At the time, the peer reviewed change was deemed low risk enough to deploy to all environments simultaneously, and it began deploying to our production environment at 21:26 UTC.

However, a side effect of another dependency, which PagerDuty uses for database access, triggered automatic loading of the unit testing dependency. This resulted in load errors in the production environment, where it was absent. The issue was not detected in the canary stage of the deployment because the application health check used in this stage did not perform the actions which invoked these load errors. As a result, the change proceeded and deployed to the entire production environment.

In response, PagerDuty engineers reverted the change that caused the issue. The revert began deploying at 21:43 and completed at 21:49 UTC.

What are we doing about this?

We have already modified the application health check to include database connectivity such that any issue affecting it will be detected in canary deployment. We are also developing an expedited deployment procedure that will allow us to remediate issues in production more quickly.

We are very sorry for any inconvenience that this incident caused. If you have any further questions, please feel free to reach out to support@pagerduty.com

Posted May 05, 2020 - 21:50 UTC

Resolved
This incident has been resolved.
Posted Apr 30, 2020 - 22:01 UTC
Monitoring
We have resolved the issue which caused the increased error rate in the PagerDuty platform, and we are now monitoring the recovery process
Posted Apr 30, 2020 - 21:59 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 30, 2020 - 21:50 UTC
Update
We are continuing to investigate this issue.
Posted Apr 30, 2020 - 21:49 UTC
Investigating
We are currently aware of increased error rates when accessing the PagerDuty platform, our teams are working to resolve the issue
Posted Apr 30, 2020 - 21:49 UTC
This incident affected: Events API, REST API, Web Application, and Mobile Application.