On March 22nd, 2023 between 17:45 UTC and 18:50 UTC, PagerDuty experienced an incident in which Past Incidents did not load contents in Web UI and Mobile UI across the US and EU service regions. During this period, the Past Incident API also returned 500 “Internal Server Error“ responses. No events were lost or dropped during these times.
During a scheduled rotation of secrets for the systems that power the Past Incidents feature, an incorrect secret key was updated causing our Web application to not be able to connect to its storage systems to fetch past incidents. This resulted in temporary unavailability in our API past_incidents
, Web UI, and Mobile UI. An additional contributing factor in not detecting the error prior to the production deployment was that our validation process was not effective and that similar secret key names were used in these systems. Around 18:25 UTC, a decision was made to toggle a switch on to return 200s empty responses instead of 500s from this API until the issue was resolved. Once the team identified that the incorrect secret key was updated and that the systems picked up the correct keys, traffic was restored gradually until the loading of past incidents was successful.
Following this incident, our teams have identified a series of proactive actions to prevent this type of failure in the future:
We sincerely apologize for the unavailability of the Past Incidents feature. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.