On July 11th, beginning at 16:14 UTC, PagerDuty’s Events API experienced a performance slowdown, causing delays in event processing for events using service-level integrations. This slowdown was fully resolved by 16:38 UTC and event processing for service-level integrations returned to normal.
Meanwhile, starting at 16:22 UTC, events using Global Event Rules and Team Rulesets (early access) experienced a second slowdown, lasting until 17:45 UTC.
Overall, 2.9% of notifications were delayed as a result of both event processing slowdowns.
Email integrations, the mobile app, our REST API and web app were not affected.
For the delay on events using service-level integrations, an external datastore used by one of the microservices in the events pipeline slowed down. This caused the number of queued write-requests to balloon and crash the microservice by running out of memory.
Additionally, in a separate microservice, a bug in a dormant code path was triggered, which caused the slowdown for events using Global Event Rules and Team Rulesets.
Although these incidents occurred at around the same time, they were unrelated.
We are adding measures so that when external datastores slow down, our microservices are not impacted as severely.
Additionally, we are changing our tooling so that we are notified about anomalies in resource usage. This will allow us to take proactive measures in the future.
We recognize that our customers rely heavily on our Events API and apologize for this slowdown and the notifications delay. If you have questions about this issue please contact our support team at firstname.lastname@example.org.