On Thursday, September 23, from 15:21 UTC to 15:47 UTC, some PagerDuty users, primarily in the US service region, experienced intermittent issues or service downtime when accessing our services via Web, Mobile or REST APIs. Event ingestion and notification delivery continued to function normally during this incident.
PagerDuty deployed a minor feature enhancement to production, which increased the frequency of calling a long duration request to a backend service in our Events pipeline. Prior to deploying this change to production, this feature was tested through our standard development and feature testing cycles, and through our internal staging environments.
However, under the increased traffic load, especially in the US service region, it multiplied the number of calls of this long duration request to an unanticipated level, overwhelming our back-end service. This triggered our internal health checks to cause a restart of the service multiple times, making the Web, Mobile and REST APIs unresponsive.
Our engineers identified the change that caused the incident and initiated a rollback of the change at 15:34 UTC. The rollback completed and the system recovered to full working state by 15:47 UTC.
Following the incident, we conducted a thorough post mortem. Based on that, we are addressing multiple factors that contributed to this issue, which are planned and currently in progress:
We understand how important and critical our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.