Past Incidents Temporarily Unavailable
Incident Report for PagerDuty
Postmortem

Summary

On March 22nd, 2023 between 17:45 UTC and 18:50 UTC, PagerDuty experienced an incident in which Past Incidents did not load contents in Web UI and Mobile UI across the US and EU service regions. During this period, the Past Incident API also returned 500 “Internal Server Error“ responses. No events were lost or dropped during these times.

What Happened

During a scheduled rotation of secrets for the systems that power the Past Incidents feature, an incorrect secret key was updated causing our Web application to not be able to connect to its storage systems to fetch past incidents. This resulted in temporary unavailability in our API past_incidents, Web UI, and Mobile UI. An additional contributing factor in not detecting the error prior to the production deployment was that our validation process was not effective and that similar secret key names were used in these systems. Around 18:25 UTC, a decision was made to toggle a switch on to return 200s empty responses instead of 500s from this API until the issue was resolved. Once the team identified that the incorrect secret key was updated and that the systems picked up the correct keys, traffic was restored gradually until the loading of past incidents was successful.

What Are We Doing About This

Following this incident, our teams have identified a series of proactive actions to prevent this type of failure in the future:

  • We are implementing better mechanisms to validate the contents of the secrets management service.
  • We are enhancing our testing and detection mechanisms during the rotation of secrets.

We sincerely apologize for the unavailability of the Past Incidents feature. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Apr 12, 2023 - 16:20 UTC

Resolved
We have resolved an incident where all PagerDuty customers in both the US and EU service regions experienced issues with accessing and viewing past incidents within the mobile and web applications. We pushed a solution to a subset of internal subdomains and were able to successfully load past incidents in the web and mobile application in the US Service Region. We rolled out the solution to the web and mobile application, and are seeing successful past incident requests in both Service Regions. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Mar 22, 2023 - 19:10 UTC
Update
We are continuing to investigate an incident where all of PagerDuty customers are experiencing issues with accessing and viewing past incidents within the mobile and web applications. We are investigating a few possible solutions. We will provide further updates within 20 minutes.
Posted Mar 22, 2023 - 18:43 UTC
Update
We are continuing to investigate an incident where all of PagerDuty customers are experiencing issues with accessing and viewing past incidents within the mobile and web applications. We are investigating the issue. We will provide further updates within 20 minutes.
Posted Mar 22, 2023 - 18:22 UTC
Identified
We are investigating an incident where all PagerDuty customers in all regions are experiencing issues with viewing past incidents within the mobile and web application. Impacted customers may see a loading screen on the web and an empty banner in the mobile app. We will provide further updates within 20 minutes.
Posted Mar 22, 2023 - 18:01 UTC
Investigating
We are investigating a potential issue within PagerDuty. If we confirm an impact, we will update within 15 minutes. If there is no impact this notification will be removed.
Posted Mar 22, 2023 - 17:55 UTC
This incident affected: Web Application (Web Application (US), Web Application (EU)) and Mobile Application (Mobile Application (US), Mobile Application (EU)).