Delayed event processing
Incident Report for PagerDuty
Postmortem

Summary

On February 27th beginning at 4:24 AM UTC, the Events API’s data processing pipeline slowed down due to an issue in one of its component microservices. Our engineers worked to identify the cause, and final measures were taken to prevent it from recurring at 6:20 AM UTC. The event processing pipeline fully recovered at 6:27 AM UTC.

During the incident, as much as 8.0% of notifications were significantly delayed due to the delay in triggering incidents, and approximately 0.5% of the events ingested during this period were dropped as a side effect of the necessary remedial actions. This incident only affected service-level integrations that use the Events API; other integrations such as email-based and Global Event Routing integrations were unaffected.

What Happened?

The microservice in our event processing pipeline that is responsible for evaluating service-level event rules experienced high CPU load. This was caused by a bug that manifested when processing events that had a certain unexpected format. Our engineers worked to keep the pipeline flowing while searching for the problematic events. A fraction of the traffic was shunted to an offline queue for reprocessing at a later time during the initial investigation.

After pipeline ingestion was believed to have been isolated from the problematic events, our engineers attempted to re-process the deferred events. However, the presence of additional problematic events in the offline queue resulted in a resurgence of the original issue. At the time, internal tooling was incapable of filtering out the problematic events, and the decision was made to drop the remaining deferred events to prevent further impact to the service.

What Are We Doing About This?

As part of our ongoing commitment to improve the operational resilience of our Events API, we are decommissioning the system that was affected by this issue, in order to replace it with our newer and more general-purpose event automation rule service. The newer service has been extended to allow it to perform the same processing operations as the older rule evaluation service as well as global event rules and routing. Customers should not experience any functional difference, although our processing pipeline as a whole should no longer be affected by the previous issue.

Additionally, we are making improvements to our operational tooling to allow more fine-grained control over the reprocessing of events from the offline queue, as well as giving us the ability to more quickly identify the scope of customer impact should it again be deemed necessary to drop any events in this queue.

We understand how important and critical our platform is for our customers, and stand by our commitment to providing the most reliable and resilient platform in the industry. We regret the impact that this incident may have had on you and your organization.

Posted 4 months ago. Mar 13, 2019 - 00:19 UTC

Resolved
We have identified the cause of the issue and have implemented a fix, and our systems have recovered.
Posted 5 months ago. Feb 27, 2019 - 06:20 UTC
Update
We are continuing to investigate and take action to mitigate delays in event processing, and have observed improvement but not full recovery.
Posted 5 months ago. Feb 27, 2019 - 05:31 UTC
Investigating
We are investigating an issue causing delays in inbound event processing.
Posted 5 months ago. Feb 27, 2019 - 04:50 UTC
This incident affected: Events API and Notification Delivery.