On August 27, 2018, at 18:54 UTC, PagerDuty experienced a 51-minute performance degradation that caused delayed incident creation through HTTP (Events API) and email-based integrations, as well as delays in notification delivery. During this time, incidents were created more slowly, and notifications for those incidents were sent later than they otherwise would be.
In the final steps of our procedure to update our web application’s data storage schema, the deployment of new application code that coincided with the schema migration failed our canary deploy tests and the deployment was canceled. However, the schema versioning information, which is used to instruct the web application hosts of the appropriate schema cache fileset to use for rebuilding data models, had been updated. This left our application in an inconsistent state since the versioning information referenced a schema that was no longer available. The impacted hosts experienced elevated system resource usage and other adverse effects.
Due to a miscommunication during the process of reverting the deployment, the schema version information was not reverted. Furthermore, the reversion was applied in such a way that it triggered a deploy to the entire fleet. This caused the same effects that were originally limited to the hosts included in the canary deploy to be exhibited fleet-wide, as the hosts began trying to rebuild data models without the updated schema cache, which would normally be generated during a schema migration.
Once these missteps were realized, the version information was reverted and the web application was restarted. This corrected the version mismatch between the recorded version information and the latest schema cache fileset on the hosts’ filesystems. Within minutes, PagerDuty finished working through the backlog of tasks that had accumulated in its event processing data pipeline, and our services fully recovered.
Our approach to mitigating risk in the migration process is two-fold. First, we are improving our internal documentation around the process of schema migrations, including procedures for remedial actions to take when key steps in a migration do not succeed. Second, we are investigating ways to safely automate the final steps in the migration process, as well as safeguards against the condition that caused the service degradation in this incident.