500 Errors in WebUI and Delayed Stakeholder Notifications and Responder Requests
Incident Report for PagerDuty
Postmortem

Summary

On Monday, August 19, 2019, at 17:58 UTC, PagerDuty experienced a 50-minute performance degradation on a subset of our product features. For the duration of the incident, users were unable to receive Responder Requests, Incidents Status Updates, and Webhooks. Additionally, users received 500 errors from the Users REST API endpoint and in the Web app when viewing User Profiles.

During this time, 70% of Responder Requests, all Status Updates, and all Webhooks were delayed until the incident was resolved. The remaining 30% of Responder Requests could not be delivered.

Notifications and the rest of the PagerDuty platform and products were unaffected during this incident.

What Happened

A configuration change inadvertently affected access to a port used to communicate to our messaging service. As a result, the affected services were unable to talk to the messaging service to handle these requests. Additionally, the affected services and Webhooks use the same fleet of background tasks to deliver their requests. With background tasks trying unsuccessfully to serve request traffic, the number of background tasks available to serve Webhooks was depleted, leading to a delay in webhook delivery.

The original change was intended to target other ports on our messaging service and not the one that led to the degradation of service. Due to an unintended naming collision, this port was inadvertently overwritten. Because we did not anticipate this accidental overwrite, the root cause was not discovered until later on in the incident.

After the configuration change was reverted, service to the affected portions of PagerDuty were fully restored and pending Responder Requests, Status Updates, and Webhooks were successfully delivered.

What Are We Doing About This?

We have initiated a change to our port naming scheme, particularly the port that was affected, to ensure accidental overwrites do not happen again.

Additionally, we are updating our Webhooks service to no longer use these background tasks.

Finally, we are working to ensure that changes to our messaging service are easier to test in a staging environment to prevent regressions in our production environment.

We are very sorry for the impact to our customers from this incident. We know how important PagerDuty is to our customers to do their best work, and we did not meet that commitment. We will continue to work to improve our service from this incident. If you have any further questions, please reach out to support@pagerduty.com

Posted Aug 23, 2019 - 17:30 UTC

Resolved
We have fully recovered.
Posted Aug 19, 2019 - 16:47 UTC
Monitoring
We have deployed a fix and are currently monitoring for signs of recovery.
Posted Aug 19, 2019 - 16:41 UTC
Investigating
We are currently experiencing 500 errors when viewing user profiles. Stakeholder notifications and responder requests are also delayed. We have identified the issue and are currently working to mitigate impact.
Posted Aug 19, 2019 - 16:39 UTC
This incident affected: Web Application.