Delayed Event Processing
Incident Report for PagerDuty
Postmortem

Summary

Starting on September 11, 2019 at 14:40 UTC and ending on September 11, 2019 at 17:17 UTC we had an internal networking communication disruption. An internal system used for service discovery came under high load and had delayed response times. As a result, multiple services related to how we group alerts were unable to talk to each other. This was in the midst of a planned infrastructure migration of old, manually provisioned and managed AWS EC2 hosts to a newer containerized orchestration environment.

During the incident, a percentage of our notifications were significantly delayed due to the delay in processing events. We did not drop any events during this time. PagerDuty’s web, mobile apps, and PagerDuty’s REST APIs were functional during the entire time of the incident and remained available.

What Happened

Due to a few key differences in traffic profiles between our test and production environments, these new hosts were unable to process traffic quickly enough in certain situations. This subsequently led to these hosts failing health checks, causing them to be marked as unhealthy and no longer eligible to accept traffic. This depletion resulted in more traffic being shifted to already struggling nodes, pushing them over as well, while new hosts were not able to be spun up quickly enough without additional impact or resource exhaustion.

A rollback was initiated, but the older fleet had been scaled down over time to free up resources. This meant that the rollback took longer than expected as this requires re-provisioning a sufficient number of machines to handle the full load. Once the old environment had been scaled back up, traffic was redirected and event processing resumed as normal. By 16:45 UTC, we had signs of recovery.

What Are We Doing About This?

The migration is still planned, but the plan has been revisited to add more logging and monitoring, and to better facilitate failure rollback scenarios. We have also tagged various key members of partner teams to increase overall visibility and reduce unexpected impact.

We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Sep 18, 2019 - 20:56 UTC

Resolved
We are in full recovery. Our Engineers have successfully placed remediation steps to ensure this issue will not re-occur.
Posted Sep 11, 2019 - 17:17 UTC
Update
Event processing is fully restored. All previously queued events have been successfully processed. Our Engineering team is in the process of implementing mitigation actions to ensure this issue will not re-occur.
Posted Sep 11, 2019 - 17:05 UTC
Monitoring
A fix has been successfully deployed, and we are seeing signs of recovery. We are currently monitoring the results.
Posted Sep 11, 2019 - 16:27 UTC
Identified
We have identified the cause of the issue. Our Engineering team is is currently deploying a fix.
Posted Sep 11, 2019 - 16:14 UTC
Investigating
We are currently experiencing an issue with event processing through our Events API and with email integrations. Our engineering team is actively investigating solutions.
Posted Sep 11, 2019 - 16:04 UTC
This incident affected: Events API.