On December 8th, 2020, from 19:50 UTC to 21:08 UTC PagerDuty experienced a major incident that was published as various disruptions to normal function in our web UI.
During this time, multiple web pages were inaccessible or unusable. These included the login, teams, and service directory pages.
At 19:50 UTC, a configuration change that was intended to improve security for user sessions was deployed to the web servers. However, this caused authentication to fail for requests to several distributed services. Within minutes, PagerDuty engineers were notified of the issue.
At 20:06 UTC, the configuration change was identified as the likely cause, and an emergency rollback was attempted shortly thereafter. However, that functionality was impaired by other work that had been deployed on the same day, forcing the configuration change to be undone via the regular, incremental rollout.
By 20:30 UTC the configuration change was undone on all the web servers. Logged in web users who were active during the disruptions were required to log back in, after which pages began to load normally. Additional time was taken to continue monitoring the application.
At 21:08 UTC PagerDuty determined that all systems had been restored.
We are working on increasing the coverage of the automated test suites for the full stack of our distributed services, and we are adding more health checks to the web application’s deployment canaries to better protect against bad deployments. Lastly, we are improving the reliability of the emergency rollback function, which would have significantly shortened the length of the incident.
We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.