On April 30th, between 21:26 and 21:49 UTC, a portion of requests to the PagerDuty web application and REST API were met with status 500 responses. Events sent to the Events API were also delayed.
The impact started with a small percentage of requests and peaked between 21:35 and 21:43 UTC, during which 100% of requests received Internal Server Error responses. The issue was fully resolved at 21:49, and all remaining unprocessed events were fully processed by 22:00 UTC.
As part of an upgrade to the PagerDuty web application and its REST API, a dependency used only in unit tests was intentionally removed. At the time, the peer reviewed change was deemed low risk enough to deploy to all environments simultaneously, and it began deploying to our production environment at 21:26 UTC.
However, a side effect of another dependency, which PagerDuty uses for database access, triggered automatic loading of the unit testing dependency. This resulted in load errors in the production environment, where it was absent. The issue was not detected in the canary stage of the deployment because the application health check used in this stage did not perform the actions which invoked these load errors. As a result, the change proceeded and deployed to the entire production environment.
In response, PagerDuty engineers reverted the change that caused the issue. The revert began deploying at 21:43 and completed at 21:49 UTC.
We have already modified the application health check to include database connectivity such that any issue affecting it will be detected in canary deployment. We are also developing an expedited deployment procedure that will allow us to remediate issues in production more quickly.
We are very sorry for any inconvenience that this incident caused. If you have any further questions, please feel free to reach out to support@pagerduty.com