Incident Responders unable to accept/decline responder requests
Incident Report for PagerDuty
Postmortem

Summary

On Friday, April 14th, between 19:52 UTC to Tuesday, April 18th, 21:55 UTC, PagerDuty experienced an incident in the EU Service Region that prevented responder requests from completing. On Tuesday, April 18th, between 20:00 UTC and 22:07 UTC, this incident also impacted the US Service Region. During this time, responder requests were being delivered to recipients, but the recipients could not accept or decline the responder requests via SMS or voice. Tuesday April 18th at 21:55 UTC, we took steps to mitigate the issue in the EU service region and at 22:07 UTC, we took the same steps in the US service region and were able to confirm recovery in both service regions.

What Happened

To ensure PagerDuty continues to operate on well-supported software dependencies, we completed a major version upgrade of our configuration management software on April 11 at 19:00 UTC. This upgrade introduced a faulty configuration change in the load balancer service, but without a service reload, no disruption occurred. 

As part of another maintenance in the EU region, on April 14 at 19:52 UTC, we reloaded the load balancer service on all the nodes, thus causing a service failure due to the faulty configuration previously deployed. During this time, responder requests were being delivered to recipients, but the recipients could not accept or decline the responder requests via SMS or voice.

At 22:25 UTC, our engineers triaged the problem and tested a responder request in the EU and received the notification but were unaware that acceptances or declines were not working via voice and SMS. We failed to test the complete end-to-end responder request call. No other impact was discovered at this time.

On April 18th, At 20:15 UTC, a similar maintenance occurred in US regions, and the impact was immediately evident with high US traffic and elevated error rate compared to the EU region. The responsible teams started a major incident call to triage. At 21:55 UTC in the EU service region, we discovered the flawed load balancer configuration and deployed the required fix. Teams thoroughly tested the change, including the acceptances and declines for the responder request call via voice and SMS. At 22:07 UTC, an identical fix was implemented for the US service region.

What Are We Doing About This

We ran a detailed post-mortem analysis of this occurrence which helped us pinpoint the key factors that led to this failure. Our engineering teams have diligently worked to rectify these issues and ensure we are safeguarded moving ahead against similar events. The following were among the corrective measures:

  • We improved the test coverage of our service to validate compatibility and health before rollout.
  • We are enhancing monitoring for both services affected by the incident so that we can discover these types of problems before they cause disruptions.
  • Tweaking our upgrade cadence for the service in question to have smaller, more frequent upgrades rather than a significant large upgrade.
  • We are revising our monitoring framework to alert based on the error percentage of complete callback calls for responder requests.

We regret the impact this incident has had on you and your teams. As always, we remain committed to offering the industry's most dependable and resilient platform. Please contact support@pagerduty.com if you have any questions.

Posted Apr 28, 2023 - 20:17 UTC

Resolved
We have resolved an incident where all PagerDuty customers in both the US and EU service regions experienced issues with responder requests. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Apr 18, 2023 - 22:41 UTC
Monitoring
We are continuing to monitor improvement in an incident with responder requests. We have deployed a fix, and we expect systems to continue to improve. We are seeing recovery in both the EU region and US region. We will provide further updates within 20 minutes.
Posted Apr 18, 2023 - 22:18 UTC
Update
We are continuing to monitor improvement in an incident with responder requests. We have deployed a fix, and we expect systems to continue to improve. We are seeing recovery in the EU region and are working on the US region. We will provide further updates within 20 minutes.
Posted Apr 18, 2023 - 22:10 UTC
Update
We are continuing to investigate an incident where all PagerDuty customers are experiencing issues with responder requests. We are continuing to make progress on mitigating the issue. We will provide further updates within 20 minutes.
Posted Apr 18, 2023 - 21:34 UTC
Update
We are continuing to investigate an incident where all PagerDuty customers are experiencing issues with responder requests. We have started making progress on mitigating the issue. We will provide further updates within 20 minutes.
Posted Apr 18, 2023 - 21:10 UTC
Update
We are continuing to investigate an incident where all PagerDuty customers are experiencing issues with responder requests. Impacted customers may see that users are unable to accept or decline responder requests. We will provide further updates within 20 minutes.
Posted Apr 18, 2023 - 20:52 UTC
Identified
We are investigating an incident where all PagerDuty customers in all service regions are experiencing issues with the incident timeline. Impacted customers may see missing responder requests in the incident timeline. We will provide further updates within 20 minutes.
Posted Apr 18, 2023 - 20:46 UTC
Investigating
We are investigating a potential issue within PagerDuty. If we confirm an impact, we will update within 15 minutes. If there is no impact this notification will be removed.
Posted Apr 18, 2023 - 20:33 UTC
This incident affected: Notification Delivery (Responder Requests (US), Responder Requests (EU)).