On Thursday, May 4th 2023, between 15:23 UTC and 16:16 UTC, event processing in our EU service region was delayed for roughly 1.3% of our customers. During this window, customers may have experienced slower than usual response times using our Web or Mobile applications. This incident did not impact other service regions.
On May 4th at 15:23 UTC, as part of ongoing system upgrades, we were doing rolling restarts of servers in one of our distributed synchronization clusters. Unfortunately, during one such reboot sequence, an instance was rebooted out of order, causing the entire cluster to go out of service. This caused internal requests, which managed event processing, to time out until the cluster came back online. By 15:34 UTC, the system came back online and requests were being processed successfully. As a side effect of this, however, a single client in our application held onto a lock, when it should have been discarded starting at 15:45 UTC. This led to event processing being delayed for approximately 30 minutes until a rolling restart of the application freed the lock. By 16:14 UTC, recovery was seen and the remaining events were processed within 2 minutes.
We plan to approach this problem from a few angles. The first is the system itself; later versions of the system affected handle this scenario more gracefully, so we intend to upgrade it to its latest version. Additional observability will be put in place, so we are alerted in the event a lock is stuck for faster remediation. Finally, we are looking into improving the overall upgrade process of this system, in order to reduce points of error.
We sincerely apologize for the delayed events and degraded experience in the EU service region. We’ll work toward preventing similar incidents such as this in the future. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.