On February 27th, 2020, from 23:41:50 UTC to 23:57:20 UTC, PagerDuty experienced an incident causing the PagerDuty web application to be inaccessible. The REST API, notification delivery, event ingestion, and the mobile app were all unaffected during this time.
A configuration change was made to PagerDuty's front-end load balancers. When the change was initially deployed to production everything appeared to look successful. Unfortunately, it turned out that the change had not properly taken effect and put the load balancers into a state where they could not reload any new changes without being restarted. The load balancers continued to run normally during this period. However, 1.5 hours later a separate deployment of another downstream system was made which would have required a reloading of the load balancer configuration, but due to the earlier change this reload did not take effect.
The effects of the earlier break had not originally been observed during testing due to missing alerting on this particular error condition.
Once the issue was observed in production, the load balancer configuration change was reverted. The effect was immediate and functionality was then restored.
The original configuration changes to the load balancers have since been successfully deployed with the necessary steps to ensure that the same issue did not reoccur. Additional alerting on potential breaking changes has also been added to both testing and production environments so that this type of error would be caught during testing in the future.
We know that our customers rely on PagerDuty to provide up-to-date and accurate information. We apologize for this degradation, and we will do our best to make sure that this does not happen again. For any questions, comments, or concerns, please reach out to firstname.lastname@example.org.