Delay Processing Incoming Events
Incident Report for PagerDuty
Postmortem

Summary

From June 21st, 2020 at 18:30 UTC to 20:24 UTC PagerDuty experienced a major incident that was published as delays with processing incoming events.

During this time, a significant percentage of events sent via the Events API were being heavily throttled. Events that were accepted were not immediately processed, resulting in delays in incident and notification creation. This affected all customers who sent events to us during this window.

What Happened

One of our containers hosting our internal load balancers encountered a memory allocation problem at 18:33 UTC due to a bug in our third-party load balancer software. Our event ingestion system saw an increase in the overall number of failed requests from the load balancers, causing our event processing to be automatically throttled. While we were still accepting some events, it was not enough to keep up with all user activity.

Within minutes, PagerDuty engineers were notified of the issue; however, internal communication issues led to delays in the publishing of incident updates.

At 19:51 UTC, the problem was isolated to the load balancers, and our engineers forced a rolling restart on those containers. Shortly thereafter, events began processing at a normal rate, and the backlog of events began to shrink.

At 20:17 UTC the backlog was cleared, and replay of the events from the dead letter queue was started.

At 20:24 UTC PagerDuty determined that all systems had been restored.

What Are We Doing About This

We're working on improvements to our internal communications to provide status updates more quickly in the event of this type of error, as well as more comprehensive visibility into our intra-service communication status.

We have defined a more accurate health check for the load balancers, in order to more quickly determine that this type of memory allocation issue has occurred. We have also implemented a more aggressive automatic restart mechanism when this condition is detected. Lastly, we are making plans to allow us to scale this load balancer cluster more elastically.

Finally we would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Jun 29, 2020 - 18:22 UTC

Resolved
Event ingestion has recovered, and we are now processing events at our normal rate.
Posted Jun 21, 2020 - 20:23 UTC
Monitoring
We have identified the issue and have implemented a fix. We are seeing improvements in event ingestion and are currently monitoring progress.
Posted Jun 21, 2020 - 20:00 UTC
Investigating
We are currently experiencing issues with event ingestion, causing delays processing incoming events. Events are still processing albeit at a delayed rate. Investigation is ongoing.
Posted Jun 21, 2020 - 19:47 UTC
This incident affected: Events API.