On November 14, from 21:55 UTC to 22:42 UTC, PagerDuty experienced a major incident that caused a degradation of event ingestion and event processing. Global events ingestion and email ingestion were unaffected by it.
During this period, events arriving at our Events API were accepted and delayed for processing up to a limit. After exceeding the event buffer allocated per routing key, service events were rejected in the Events API due to throttling. Events that were accepted were not immediately processed, resulting in delays of as much as 48 minutes in incident and notification creation. This affected all customers who sent events to a service integration via our Events API during this window.
A novel bug in a recent version of the HTTP client library utilized (but not owned) by PagerDuty was the ultimate culprit. Normal periodic timeouts triggered this bug at the point of handling events after they are accepted, where they slowly accrued and exhausted allotted HTTP connections. Once exhausted, HTTP connection pools failed to supply the HTTP connections required for events to progress, and event processing capacity was degraded. With the accepted event buffer filling up, the Events API started to reject events due to rate limiting for routing keys that saturated the allocated event buffer.
To resolve immediate customer impact at 22:32 UTC, the affected service instances were restarted, restoring event processing to its full capacity. Events already accepted and stored in the buffer were processed, and the Events API fully recovered. The HTTP client library was downgraded to the previous known working version within 1 hour of the major incident resolution while we monitored the Events API health.
Several factors were identified as contributing to this incident. We are committed to addressing those factors in order to both prevent such incidents in the future and preempt any impact of a similar future incident on the service we are providing to our customers. Actions we are taking are these:
Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions on any of this, please reach out to firstname.lastname@example.org.