Delay in processing of incoming events
Incident Report for PagerDuty
Postmortem

Summary

On November 14, from 21:55 UTC to 22:42 UTC, PagerDuty experienced a major incident that caused a degradation of event ingestion and event processing. Global events ingestion and email ingestion were unaffected by it.

During this period, events arriving at our Events API were accepted and delayed for processing up to a limit. After exceeding the event buffer allocated per routing key, service events were rejected in the Events API due to throttling.  Events that were accepted were not immediately processed, resulting in delays of as much as 48 minutes in incident and notification creation. This affected all customers who sent events to a service integration via our Events API during this window.  

What Happened

A novel bug in a recent version of the HTTP client library utilized (but not owned) by PagerDuty was the ultimate culprit.  Normal periodic timeouts triggered this bug at the point of handling events after they are accepted, where they slowly accrued and exhausted allotted HTTP connections.  Once exhausted, HTTP connection pools failed to supply the HTTP connections required for events to progress, and event processing capacity was degraded.  With the accepted event buffer filling up, the Events API started to reject events due to rate limiting for routing keys that saturated the allocated event buffer.  

To resolve immediate customer impact at 22:32 UTC, the affected service instances were restarted, restoring event processing to its full capacity. Events already accepted and stored in the buffer were processed, and the Events API fully recovered. The HTTP client library was downgraded to the previous known working version within 1 hour of the major incident resolution while we monitored the Events API health.

What We Are Doing About It

Several factors were identified as contributing to this incident. We are committed to addressing those factors in order to both prevent such incidents in the future and preempt any impact of a similar future incident on the service we are providing to our customers. Actions we are taking are these: 

  1. Better observability: we have added the new alerting for the service’s HTTP connection pool exhaustion and are working on enhancing our metrics on connection pool utilization. Visibility into the state of connection pools will give us early warning before service degradation can occur.
  2. Automatic resolution: a new service health check is being implemented to identify degraded instances that have not done useful work for a period of time. When an instance fails the health check, that instance is proactively restarted by our container orchestrator while the team is notified asynchronously about the issue to be resolved.
  3. Prevention: we are reviewing and migrating off dependency versions that contain known bugs. Moving forward, we are looking into automating the detection of blacklisted dependency versions similar to the vulnerability scanning we are currently doing.
  4. Early warning alert: a new alert is being added that notifies the team on processing rates specific to the service events handling step that experienced the issue. This alert complements existing alerts that separately detect event intake and event processing speed, before and after this step, respectively. While existing alerting did successfully page on-call engineers, this more targeted and aggressive alerting enables us to catch issues earlier - before they have a chance to become an incident.
  5. Response tooling Improvements: on-call and incident response tooling are receiving usability improvements to enhance our incident response process.

Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions on any of this, please reach out to support@pagerduty.com.

Posted Dec 01, 2020 - 23:50 UTC

Resolved
We have seen a full recovery in the processing of incoming events and will continue to monitor the situation.
Posted Nov 14, 2020 - 22:57 UTC
Monitoring
We are seeing signs of significant recovery in the processing of incoming events, and we are continuing to monitor.
Posted Nov 14, 2020 - 22:48 UTC
Investigating
We are currently experiencing issues with event ingestion, which have caused delays processing incoming events. Delayed events are still being processed however, and with a diminishing latency. Investigation is ongoing.
Posted Nov 14, 2020 - 22:41 UTC
This incident affected: Events API.