On Saturday, April 1st between 20:46 and 21:28 UTC the PagerDuty REST API operated in a degraded state in our US service region. During this time, customers interacting with the REST API experienced increased 5xx error rates and increased response times. 91% of requests to the REST API were still successful during this time period. All other components including the Events API, Notification Delivery, Web Application, and Mobile Application functioned normally and were not impacted. Other service regions were not impacted by this incident.
We had recently completed an upgrade that moved our REST API services onto our new container runtime infrastructure. This upgrade provides us with new capabilities for us to provide a more reliable and resilient service for you, our customers. Throughout the testing, verification, and rollout process, the service operated normally and without issue.
Today, our REST API services consist of a primary service for handling requests and a sidecar proxy that handles service discovery and other networking-related concerns. When a request is processed by our REST API service, it is first handled by this primary service and then proxied through the sidecar before arriving at a final destination.
On Saturday, April 1st at 20:46 UTC, a subset of our REST API instances encountered a “noisy neighbor” condition, leaving them unable to obtain the necessary compute resources to continue processing requests. In these instances, the primary service exited and was restarted by the container runtime. Upon restarting, the primary service entered a state where it was able to receive traffic but it was unable to communicate with the sidecar proxy to complete the proper handling of the request. These instances were effectively in an unhealthy state but our health checks did not correctly model this failure mode.
By 20:48 UTC approximately 12% of requests to the REST API in the US service region were resulting in a 5xx response code. At this same time the team responsible for the PagerDuty REST API was paged. During the initial response we saw signs that the REST API might be overloaded. We scaled up our REST API by 50% and by 21:05 UTC, only 8% of requests were resulting in a 5xx response code.
Aware that our REST API had recently been moved to the new container runtime, we simultaneously began the process of moving back to our legacy container runtime. Since we believed that the REST API was only in a state of partial failure, we performed this rollback in a way that minimized our risk of exacerbating the problem even further. We took care to avoid a state where the new runtime instances had been removed but the legacy runtime instances had not yet completed deployment.
By 21:21 UTC, the last of the unhealthy instances were removed as part of the rollback process. Error rates for the REST API in the US service region had returned to normal levels by 21:22 UTC. No other service regions or systems were affected during this time.
Following this incident, our teams held a thorough incident review to ensure we have additional controls in place for this failure mode. In particular we will be changing our health check strategy to correctly detect this unhealthy service state and we have already resolved the root cause of the noisy neighbor condition.
Our teams continue to investigate both the cause of how these REST API instances ended up in a state where they were unable to establish intra-pod network connections after the noisy neighbor condition and whether additional infrastructure changes are required to ensure proper resource allocation.
We sincerely apologize for the degraded REST API service in our US region. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.