Web UI, REST API, mobile application issues
Incident Report for PagerDuty
Postmortem

Summary

On Thursday, September 23, from 15:21 UTC to 15:47 UTC, some PagerDuty users, primarily in the US service region, experienced intermittent issues or service downtime when accessing our services via Web, Mobile or REST APIs. Event ingestion and notification delivery continued to function normally during this incident.

What Happened

PagerDuty deployed a minor feature enhancement to production, which increased the frequency of calling a long duration request to a backend service in our Events pipeline. Prior to deploying this change to production, this feature was tested through our standard development and feature testing cycles, and through our internal staging environments.

However, under the increased traffic load, especially in the US service region, it multiplied the number of calls of this long duration request to an unanticipated level, overwhelming our back-end service. This triggered our internal health checks to cause a restart of the service multiple times, making the Web, Mobile and REST APIs unresponsive.

Our engineers identified the change that caused the incident and initiated a rollback of the change at 15:34 UTC. The rollback completed and the system recovered to full working state by 15:47 UTC.

What We Are Doing About This

Following the incident, we conducted a thorough post mortem. Based on that, we are addressing multiple factors that contributed to this issue, which are planned and currently in progress:

  • Improve Development Testing: Add additional testing to capture when the number of requests that the web pages are generating change significantly and unexpectedly.
  • Graceful degradation: Make modifications to the Events back-end service that will handle the degradation or unavailability of the service.
  • Fleet segregation: Isolate Web API, Mobile API and Public API, so that issues affecting one area of functionality will not impact the others.
  • Update deployment templates: Update the deployment templates to shorten the time for rollback procedure.

We understand how important and critical our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Oct 01, 2021 - 21:39 UTC

Resolved
We have recovered. Web UI, REST API, and mobile application are now working as expected.
Posted Sep 23, 2021 - 15:53 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 23, 2021 - 15:49 UTC
Identified
This issue has been identified and a fix is being implemented.
Posted Sep 23, 2021 - 15:44 UTC
Investigating
We are currently experiencing issues with the REST API, Web UI, and mobile application. Event ingestion and notifications are not effected and working as expected. We are currently investigating.
Posted Sep 23, 2021 - 15:26 UTC
This incident affected: REST API (REST API (US)), Mobile Application (Mobile Application (US)), and Web Application (Web Application (US)).