Web UI pages not able to load
Incident Report for PagerDuty
Postmortem

Summary

On December 8th, 2020, from 19:50 UTC to 21:08 UTC PagerDuty experienced a major incident that was published as various disruptions to normal function in our web UI.

During this time, multiple web pages were inaccessible or unusable. These included the login, teams, and service directory pages.

What Happened

At 19:50 UTC, a configuration change that was intended to improve security for user sessions was deployed to the web servers. However, this caused authentication to fail for requests to several distributed services. Within minutes, PagerDuty engineers were notified of the issue.

At 20:06 UTC, the configuration change was identified as the likely cause, and an emergency rollback was attempted shortly thereafter. However, that functionality was impaired by other work that had been deployed on the same day, forcing the configuration change to be undone via the regular, incremental rollout.

By 20:30 UTC the configuration change was undone on all the web servers. Logged in web users who were active during the disruptions were required to log back in, after which pages began to load normally. Additional time was taken to continue monitoring the application.

At 21:08 UTC PagerDuty determined that all systems had been restored.

What Are We Doing About This

We are working on increasing the coverage of the automated test suites for the full stack of our distributed services, and we are adding more health checks to the web application’s deployment canaries to better protect against bad deployments. Lastly, we are improving the reliability of the emergency rollback function, which would have significantly shortened the length of the incident.

We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Dec 18, 2020 - 17:25 UTC

Resolved
We have fully recovered. Users may need to login newly in the web UI in order to see restored functionality.
Posted Dec 08, 2020 - 21:08 UTC
Monitoring
We have implemented a fix and have seen full recovery of functionality. We are continuing to monitor.
Posted Dec 08, 2020 - 20:36 UTC
Investigating
We’re currently seeing various disruptions to normal function in our web UI, including unexpected errors, inability to load pages such as the service directory, continual refreshing, and login issues. We are currently investigating.
Posted Dec 08, 2020 - 20:19 UTC
This incident affected: Web Application (Web Application (US)).