From June 21st, 2020 at 18:30 UTC to 20:24 UTC PagerDuty experienced a major incident that was published as delays with processing incoming events.
During this time, a significant percentage of events sent via the Events API were being heavily throttled. Events that were accepted were not immediately processed, resulting in delays in incident and notification creation. This affected all customers who sent events to us during this window.
One of our containers hosting our internal load balancers encountered a memory allocation problem at 18:33 UTC due to a bug in our third-party load balancer software. Our event ingestion system saw an increase in the overall number of failed requests from the load balancers, causing our event processing to be automatically throttled. While we were still accepting some events, it was not enough to keep up with all user activity.
Within minutes, PagerDuty engineers were notified of the issue; however, internal communication issues led to delays in the publishing of incident updates.
At 19:51 UTC, the problem was isolated to the load balancers, and our engineers forced a rolling restart on those containers. Shortly thereafter, events began processing at a normal rate, and the backlog of events began to shrink.
At 20:17 UTC the backlog was cleared, and replay of the events from the dead letter queue was started.
At 20:24 UTC PagerDuty determined that all systems had been restored.
We're working on improvements to our internal communications to provide status updates more quickly in the event of this type of error, as well as more comprehensive visibility into our intra-service communication status.
We have defined a more accurate health check for the load balancers, in order to more quickly determine that this type of memory allocation issue has occurred. We have also implemented a more aggressive automatic restart mechanism when this condition is detected. Lastly, we are making plans to allow us to scale this load balancer cluster more elastically.
Finally we would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org.