Issue with Global Routing Event Processing
Incident Report for PagerDuty
Postmortem

Summary

On July 11th, beginning at 16:14 UTC, PagerDuty’s Events API experienced a performance slowdown, causing delays in event processing for events using service-level integrations. This slowdown was fully resolved by 16:38 UTC and event processing for service-level integrations returned to normal.

Meanwhile, starting at 16:22 UTC, events using Global Event Rules and Team Rulesets (early access) experienced a second slowdown, lasting until 17:45 UTC.

Overall, 2.9% of notifications were delayed as a result of both event processing slowdowns.

Email integrations, the mobile app, our REST API and web app were not affected.

What Happened?

For the delay on events using service-level integrations, an external datastore used by one of the microservices in the events pipeline slowed down. This caused the number of queued write-requests to balloon and crash the microservice by running out of memory.

Additionally, in a separate microservice, a bug in a dormant code path was triggered, which caused the slowdown for events using Global Event Rules and Team Rulesets.

Although these incidents occurred at around the same time, they were unrelated.

What Are We Doing About This?

We are adding measures so that when external datastores slow down, our microservices are not impacted as severely.

Additionally, we are changing our tooling so that we are notified about anomalies in resource usage. This will allow us to take proactive measures in the future.

We recognize that our customers rely heavily on our Events API and apologize for this slowdown and the notifications delay. If you have questions about this issue please contact our support team at support@pagerduty.com.

Posted Jul 18, 2019 - 15:14 UTC

Resolved
Event processing for Global Routing keys and emails has fully recovered.
Posted Jul 11, 2019 - 18:08 UTC
Update
Events for Global Routing keys and emails are being processing from the past hour, so customers may see old events being created and updated in their account.
Posted Jul 11, 2019 - 17:50 UTC
Monitoring
We have deployed a solution and are currently monitoring event processing for signs of recovery.
Posted Jul 11, 2019 - 17:25 UTC
Identified
We have identified the issue and are working on deploying a solution.
Posted Jul 11, 2019 - 17:08 UTC
Investigating
We are currently experiencing an issue with event processing through our Events APIs using Global Routing keys and emails. Our engineering team is actively investigating.
Posted Jul 11, 2019 - 16:59 UTC
This incident affected: Events API.