On February 13th, 2023, between 18:50 UTC and 21:01 UTC, PagerDuty experienced an incident that caused delays of up to 6 minutes in the delivery of notifications and subscriber updates in the US and EU service regions as well as bursts of errors in the Web UI, Mobile UI, and the REST API. PagerDuty experienced another incident between 16:16 UTC and 18:57 UTC on Tuesday, February 14th that caused delays of up to 9 minutes in the delivery of notifications and subscriber updates in the US and EU service regions as well as bursts of errors in the Web UI, Mobile UI, and the REST API. During these periods, approximately ~1% of requests to the Web UI, Mobile UI, or REST API returned 500 “Internal Server Error” responses. The PagerDuty web application was unavailable to the EU service region between 17:55 UTC and 18:26 UTC. No events were lost or dropped during these times.
The incident had multiple contributing causes, which can best summarized as “dependencies are hard.” The actual fault was caused by an upgrade of a shared component which introduced a latent bug that was only triggered when services downstream of that component were redeployed.
Several of our services use a request router component to send requests to downstream services. That service uses an internal base image that is one of our shared platform components. The base image was updated on Monday, February 13th at approximately 18:00 UTC to update an external software library in one of the components of the request router service to its latest minor version. Due to the combination of how the component uses this library and a change in the external library’s runtime; there was a breaking change associated with this update. The service pins the major and minor versions; this allows us to automatically get security patches when we redeploy our Web application. This change was pushed to the same minor version that our service was pinned to, so at 18:04 UTC when a new deploy for our Web application started, it automatically pulled the new base image. Both the team making the change and the teams responding to the incident were misled by the significance of the update.
An additional contributing factor in not detecting the error prior to the production deployment was that our validation process was not effective in detecting the error. The bug was latent and only materialized under load and other production conditions which made validation difficult.
Around 18:50 UTC a few teams were paged for a spike in errors in the EU service region. A downstream service in the EU service region had recently restarted, which triggered the request router service to reload in our Web application. However they failed to reload and instead crashed immediately. Our container orchestration system automatically terminated the request router service that had crashed and replaced them with healthy instances. Our teams began investigating the issues. As they investigated, a downstream service started deploying to both EU and US service regions around 19:20 UTC. Our teams were paged again for another spike in errors, this time in both EU and US regions, and initiated a major incident call. During this time customers may have observed some requests to the Web UI, Mobile UI, or REST API fail; less than 1% of all requests failed during this period. After observing no more customer impact during the incident call, the incident was assigned to a team to continue investigating and the major incident call was ended. Our teams did not post to our status page at this time because they believed the issue was transient and did not observe any ongoing customer impact.
Another major incident call was started around 20:05 UTC after another spike of errors occurred. As teams rejoined the call, we updated our status page as we continued investigating the issue. Our teams looked at recent changes to the Web application and noticed there was a Web application deployed earlier that morning that had modified the file system configuration in the request router service. The change was reverted and after observing our systems in a stable state our teams concluded that the file system change was the culprit. There were no more downstream deployments for the remainder of the day. Between 18:50 - 20:15 UTC 4 notifications were delayed in the EU region which affected 2 accounts. In this same time window, 10 notifications were delayed in the US region which affected 7 accounts.
On Tuesday, February 14th at approximately 16:16 UTC, a downstream service started deploying to both EU and US regions. Our teams were paged for another spike in errors in both regions. Similarly to the day before, customers may have observed some requests to the Web UI, Mobile UI, or REST API fail during this period. We initiated a major incident call to investigate the issue. Our teams initially concluded that the errors were transient and since there was no more ongoing customer impact the major incident call was ended. Between 16:00 - 17:00 UTC one notification was delayed in the EU region which affected one account. In the same time window, 59 notifications were delayed in the US region which affected 26 accounts. Our teams did not post to our status page at this time because they believed the issue was transient and did not observe any ongoing customer impact.
Our teams were paged again at 17:26 UTC and we initiated another major incident call. Customers may have observed some requests to the Web UI, Mobile UI, or REST API fail during this period. Additionally between 17:55 UTC and 18:26 UTC the PagerDuty web application was unavailable in the EU service region. The Mobile UI and REST API were still available during this time. Between 17:00 - 18:30 UTC four notifications were delayed in the EU region which affected two accounts. In the same time window, 29 notifications were delayed in the US region which affected 12 accounts. Our teams realized the previous day’s incident was not caused by the file system change. Around the same time, another service that uses the same request router service pattern as our Web application began to exhibit similar error patterns. With two services having similar issues, our engineers noticed there was a recent change to the base image and saw the external software component used for service discovery had been updated to a new minor version. Our engineers examined the changelog for the external software component and noticed there were breaking changes in the new minor version we had updated to. After identifying the cause, we updated our status page as our engineers rolled back the change in the base request router service image. After the rollback completed, they redeployed the Web application and the other affected service. This action yielded the desired outcomes in that the request router services were once again able to reload after a downstream service redeployed or restarted.
What Are We Doing About This
Following this incident, our teams held a thorough incident review to investigate which identified a series of events that led to a failure of this nature.
We sincerely apologize for the delayed and unexpected notifications you or your teams experienced. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to firstname.lastname@example.org.