On November 18, 2019 at 20:35 UTC, PagerDuty had an incident that impacted customers using the PagerDuty Web Application, PagerDuty Mobile Application, and the PagerDuty REST API. During this time, customers would have experienced longer response times or timeouts. The incident was resolved at 20:51 UTC on November 18, 2019 by reverting the change that triggered these symptoms.
Event Ingestion and Notification Delivery were not impacted during this time.
On November 18, 2019 at 19:56 UTC, a change was made to upgrade our load balancers to the latest version for improving performance and bug fixes. The rollout plan for this change leveraged a canary mechanism we have that only upgrades a single load balancer at a time.
When one load balancer was upgraded, this impacted the rest of our non-upgraded load balancers as well. During this time, the non-upgraded load balancers were receiving traffic from the upgraded load balancer. This, in turn, caused the non-upgraded load balancers to become unhealthy and unable to serve customer traffic.
The change was reverted at 20:51 UTC and all of the load balancers returned to a healthy state immediately.
First, we are changing the way that we perform maintenance to our load balancers. Going forward, we will set up a new pool of load balancers alongside our existing pool, instead of doing these upgrades in place.
Second, we are revisiting our canary mechanism for these types of changes. The current mechanism overly relies on using complex configuration files for rolling out a change, which in this case made it hard to understand the state of which load balancers had been upgraded.
We are very sorry for the impact that this caused. We know that our customers rely on PagerDuty to be running consistently and we did not meet that promise. For any questions, comments, or concerns, please contact us at email@example.com.