Monitoring Issue
Incident Report for PagerDuty
Postmortem

Summary

PagerDuty experienced an incident starting as early as Friday, January 27th, 17:18 UTC and ending on Monday, January 30th, 19:46 UTC, that resulted in customers losing UI and API access to their Response Plays. This affected both the US and EU Service Regions. Fewer than 1% of accounts were affected. No other functionality within the application was affected.

 

What Happened

A code change was deployed to our production environments on January 27th at 17:18 UTC to upgrade a subset of accounts from using the Response Plays feature to its upcoming replacement, Incident Workflows. We inadvertently applied this change to a wider range of customer accounts than intended. We received a customer report that they were unable to access Response Plays and promptly kicked off an investigation. We kicked off a major incident call on January 30th at 18:33 UTC.

Our engineers reverted the code change and then were able to reverse the upgrade on the affected accounts. We achieved full remediation by 19:46 UTC that day. Affected US accounts had their API access restored at 18:38 UTC and UI access restored at 19:37 UTC. Affected EU accounts had their API access restored at 18:52 UTC and UI access restored at 19:42 UTC. We initially believed that Response Plays were affected starting on January 30, at 5:27 UTC. Over the course of our investigation we discovered that the impact would have begun later on January 30th, at 17:18 UTC.

 

What Are We Doing About This

Following the incident, we conducted a thorough incident review which identified a series of events which contributed to this failure. Our engineering teams have worked diligently to address these findings and ensure that we are protected from such incidents going forward. The corrective actions included the following:

  • Additional documentation and guard rails around the code that will be used to upgrade the Response Plays feature.
  • Redesigning the Response Plays upgrade flow to account for edge cases.
  • Improvements to our feature rollout practices.

 

We sincerely apologize for the interruptions with Response Plays and any impact this incident had on you and your teams. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Feb 06, 2023 - 22:39 UTC

Resolved
We have resolved an incident where a small number of PagerDuty customers in both the US and EU service regions experienced issues with inability to access Response Plays. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Jan 30, 2023 - 19:46 UTC
Update
We have a partial restoration and are in process of completing restoration for all customers.
We will continue to update progress here.
Posted Jan 30, 2023 - 19:34 UTC
Monitoring
We have identified and are actively resolving an incident where a small number of PagerDuty customers in both the US and EU service region experienced issues with inability to access Response Play functionality starting 5:27 UTC Friday. The incident is currently in the process of resolution and we will update here in 15 minutes.
Posted Jan 30, 2023 - 19:17 UTC
This incident affected: REST API (REST API (US), REST API (EU)), Web Application (Web Application (US), Web Application (EU)), Mobile Application (Mobile Application (US), Mobile Application (EU)), and Change Events (Change Events (US), Change Events (EU)).