Users Unable to Log Into PagerDuty
Incident Report for PagerDuty
Postmortem

Summary

On November 18, 2019 at 20:35 UTC, PagerDuty had an incident that impacted customers using the PagerDuty Web Application, PagerDuty Mobile Application, and the PagerDuty REST API. During this time, customers would have experienced longer response times or timeouts. The incident was resolved at 20:51 UTC on November 18, 2019 by reverting the change that triggered these symptoms.

Event Ingestion and Notification Delivery were not impacted during this time.

What Happened

On November 18, 2019 at 19:56 UTC, a change was made to upgrade our load balancers to the latest version for improving performance and bug fixes. The rollout plan for this change leveraged a canary mechanism we have that only upgrades a single load balancer at a time.

When one load balancer was upgraded, this impacted the rest of our non-upgraded load balancers as well. During this time, the non-upgraded load balancers were receiving traffic from the upgraded load balancer. This, in turn, caused the non-upgraded load balancers to become unhealthy and unable to serve customer traffic.

The change was reverted at 20:51 UTC and all of the load balancers returned to a healthy state immediately.

What Are We Doing About This

First, we are changing the way that we perform maintenance to our load balancers. Going forward, we will set up a new pool of load balancers alongside our existing pool, instead of doing these upgrades in place.

Second, we are revisiting our canary mechanism for these types of changes. The current mechanism overly relies on using complex configuration files for rolling out a change, which in this case made it hard to understand the state of which load balancers had been upgraded.

We are very sorry for the impact that this caused. We know that our customers rely on PagerDuty to be running consistently and we did not meet that promise. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Nov 23, 2019 - 01:06 UTC

Resolved
We have fully recovered. All our systems are operational, and users are now able to log into PagerDuty.
Posted Nov 18, 2019 - 22:09 UTC
Monitoring
We have deployed a fix to this issue. We're currently monitoring for full recovery. Notification Delivery and the Events API have not been impacted during the course of this incident.
Posted Nov 18, 2019 - 22:04 UTC
Investigating
We're currently experiencing an issue where users are unable to log into PagerDuty. Additionally, those who are already logged in may experience issues loading the Incidents page. We're currently investigating.
Posted Nov 18, 2019 - 21:56 UTC
This incident affected: Web Application and Mobile Application.