Event Processing Delays
Incident Report for PagerDuty


On Thursday, May 4th 2023, between 15:23 UTC and 16:16 UTC, event processing in our EU service region was delayed for roughly 1.3% of our customers. During this window, customers may have experienced slower than usual response times using our Web or Mobile applications. This incident did not impact other service regions. 

What Happened

On May 4th at 15:23 UTC, as part of ongoing system upgrades, we were doing rolling restarts of servers in one of our distributed synchronization clusters. Unfortunately, during one such reboot sequence, an instance was rebooted out of order, causing the entire cluster to go out of service. This caused internal requests, which managed event processing, to time out until the cluster came back online. By 15:34 UTC, the system came back online and requests were being processed successfully. As a side effect of this, however, a single client in our application held onto a lock, when it should have been discarded starting at 15:45 UTC. This led to event processing being delayed for approximately 30 minutes until a rolling restart of the application freed the lock. By 16:14 UTC, recovery was seen and the remaining events were processed within 2 minutes.

What Are We Doing About This

We plan to approach this problem from a few angles. The first is the system itself; later versions of the system affected handle this scenario more gracefully, so we intend to upgrade it to its latest version. Additional observability will be put in place, so we are alerted in the event a lock is stuck for faster remediation. Finally, we are looking into improving the overall upgrade process of this system, in order to reduce points of error.  

We sincerely apologize for the delayed events and degraded experience in the EU service region. We’ll work toward preventing similar incidents such as this in the future. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted May 11, 2023 - 21:13 UTC

We have resolved an incident where a small number of PagerDuty customers in the EU service region experienced delays in event processing. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted May 04, 2023 - 16:27 UTC
We are monitoring improvement in an incident with event processing delay within the EU service region. We have deployed a fix, and we expect systems to continue to improve. We will provide an update within 15 minutes.
Posted May 04, 2023 - 16:23 UTC
We are investigating a potential issue within PagerDuty. If we confirm an impact, we will update within 15 minutes. If there is no impact this notification will be removed.
Posted May 04, 2023 - 16:05 UTC
This incident affected: Events API (Events API (EU)).