500 Errors
Incident Report for PagerDuty
Postmortem

Summary

On April 13th, 2023, between 14:52 and 15:10 UTC, the PagerDuty Web Application operated in a degraded state in our US service region. During this time, customers using the PagerDuty Web Application had a sluggish experience and saw intermittent error pages. All other components, including the Events API, Notification Delivery, REST API, and Mobile Application, functioned normally and were not impacted. This incident did not impact other service regions.

What Happened

We have recently been improving the customer experience in our Web Application’s Incidents UI by utilizing a technology called Websockets. Websockets maintain a connection between the browser and the backend server, which allows PagerDuty to push new updates to the browser as they happen. These improvements have been rolling out gradually to increasing numbers of customers.

On Thursday, April 13th, at 14:47 UTC, a code change was deployed to the servers that maintain the websocket connections. The deployment process rolls out the new code gradually over about 10 minutes, gracefully terminating an old server and starting a new one, waiting a short time, and repeating until all the servers run the new code.

During the deployment process, when a server is terminated, its websockets are also disconnected. This prompts all the frontends to try to reconnect and upon reconnecting, to make several data requests. The frontends attempted their reconnect with a randomized retry window, which worked well during the initial rollout of the feature. However, this was not sufficient at full rollout to effectively spread the additional request load across our available capacity.

We were underprovisioned for the new load pattern introduced by the websockets feature. As the websockets reconnected and made their associated requests, response times increased while requests were retried, and customers saw increasingly longer times for parts of the page to be populated with data. As the request pool became overloaded, we started dropping some requests, which caused customers to see an error page.

The engineer rolling out the new code change executed an Emergency Rollback out of an abundance of caution. This had the unfortunate consequence of immediately repeating the deployment process, further contributing to the websocket reconnection load issue.

By 15:10 UTC, customer experience returned to normal as all websockets had reconnected, and we returned to your provisioned steady state load.

What Are We Doing About This

Following this incident, our teams held a thorough incident investigation and review to determine the cause of the failure to ensure we could protect against it in the future. We have increased our server capacity to handle increased load due to the websockets feature. We have also updated the websockets code to reconnect more gracefully during deployments.

We sincerely apologize for the degraded Web Application experience in our US region. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Apr 20, 2023 - 20:56 UTC

Resolved
We have resolved an incident where all PagerDuty customers in both the US and EU service regions experienced issues with brief 500 errors on web. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Apr 13, 2023 - 15:18 UTC
Monitoring
We are investigating an issue where all PagerDuty customer would have seen brief 500 errors. If we confirm an impact, we will update within 15 minutes.
Posted Apr 13, 2023 - 15:14 UTC
This incident affected: Web Application (Web Application (US), Web Application (EU)).