On April 13th, 2023, between 14:52 and 15:10 UTC, the PagerDuty Web Application operated in a degraded state in our US service region. During this time, customers using the PagerDuty Web Application had a sluggish experience and saw intermittent error pages. All other components, including the Events API, Notification Delivery, REST API, and Mobile Application, functioned normally and were not impacted. This incident did not impact other service regions.
We have recently been improving the customer experience in our Web Application’s Incidents UI by utilizing a technology called Websockets. Websockets maintain a connection between the browser and the backend server, which allows PagerDuty to push new updates to the browser as they happen. These improvements have been rolling out gradually to increasing numbers of customers.
On Thursday, April 13th, at 14:47 UTC, a code change was deployed to the servers that maintain the websocket connections. The deployment process rolls out the new code gradually over about 10 minutes, gracefully terminating an old server and starting a new one, waiting a short time, and repeating until all the servers run the new code.
During the deployment process, when a server is terminated, its websockets are also disconnected. This prompts all the frontends to try to reconnect and upon reconnecting, to make several data requests. The frontends attempted their reconnect with a randomized retry window, which worked well during the initial rollout of the feature. However, this was not sufficient at full rollout to effectively spread the additional request load across our available capacity.
We were underprovisioned for the new load pattern introduced by the websockets feature. As the websockets reconnected and made their associated requests, response times increased while requests were retried, and customers saw increasingly longer times for parts of the page to be populated with data. As the request pool became overloaded, we started dropping some requests, which caused customers to see an error page.
The engineer rolling out the new code change executed an Emergency Rollback out of an abundance of caution. This had the unfortunate consequence of immediately repeating the deployment process, further contributing to the websocket reconnection load issue.
By 15:10 UTC, customer experience returned to normal as all websockets had reconnected, and we returned to your provisioned steady state load.
Following this incident, our teams held a thorough incident investigation and review to determine the cause of the failure to ensure we could protect against it in the future. We have increased our server capacity to handle increased load due to the websockets feature. We have also updated the websockets code to reconnect more gracefully during deployments.
We sincerely apologize for the degraded Web Application experience in our US region. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to firstname.lastname@example.org.