On Friday, April 14th, between 19:52 UTC to Tuesday, April 18th, 21:55 UTC, PagerDuty experienced an incident in the EU Service Region that prevented responder requests from completing. On Tuesday, April 18th, between 20:00 UTC and 22:07 UTC, this incident also impacted the US Service Region. During this time, responder requests were being delivered to recipients, but the recipients could not accept or decline the responder requests via SMS or voice. Tuesday April 18th at 21:55 UTC, we took steps to mitigate the issue in the EU service region and at 22:07 UTC, we took the same steps in the US service region and were able to confirm recovery in both service regions.
To ensure PagerDuty continues to operate on well-supported software dependencies, we completed a major version upgrade of our configuration management software on April 11 at 19:00 UTC. This upgrade introduced a faulty configuration change in the load balancer service, but without a service reload, no disruption occurred.
As part of another maintenance in the EU region, on April 14 at 19:52 UTC, we reloaded the load balancer service on all the nodes, thus causing a service failure due to the faulty configuration previously deployed. During this time, responder requests were being delivered to recipients, but the recipients could not accept or decline the responder requests via SMS or voice.
At 22:25 UTC, our engineers triaged the problem and tested a responder request in the EU and received the notification but were unaware that acceptances or declines were not working via voice and SMS. We failed to test the complete end-to-end responder request call. No other impact was discovered at this time.
On April 18th, At 20:15 UTC, a similar maintenance occurred in US regions, and the impact was immediately evident with high US traffic and elevated error rate compared to the EU region. The responsible teams started a major incident call to triage. At 21:55 UTC in the EU service region, we discovered the flawed load balancer configuration and deployed the required fix. Teams thoroughly tested the change, including the acceptances and declines for the responder request call via voice and SMS. At 22:07 UTC, an identical fix was implemented for the US service region.
We ran a detailed post-mortem analysis of this occurrence which helped us pinpoint the key factors that led to this failure. Our engineering teams have diligently worked to rectify these issues and ensure we are safeguarded moving ahead against similar events. The following were among the corrective measures:
We regret the impact this incident has had on you and your teams. As always, we remain committed to offering the industry's most dependable and resilient platform. Please contact firstname.lastname@example.org if you have any questions.