Delayed Notifications
Incident Report for PagerDuty

Summary

On May 17, 2017 between 17:00 and 18:00 UTC, we experienced unanticipated side-effects of a system-wide load test. We halted the testing as soon as the effects were detected, but notifications were delayed as we continued to process the load we had already generated.

What Happened?

One of the ways PagerDuty maintains and develops resilient infrastructure is by deliberately exposing our systems to controlled and predictable stress tests in order to preemptively identify and fix potential problems. For the most part, there is little to no customer impact when we conduct these tests.

However, during this incident, unforeseen issues with a new load testing platform created higher than anticipated load on our system as a whole. The problem was identified almost immediately and the test was halted. A backup of queued tasks in our infrastructure dependencies then led to delays in incident notifications to customers over the next hour.

What Are We Doing About This?

Our resolution time was affected by factors intrinsic to our system architecture. Consequently, PagerDuty engineers have been and are currently working on changes to the supporting infrastructure of our product and have already seen measured improvement in our capacity to handle similar scenarios.

We would like to again apologize for any inconvenience this issue caused. If you have any questions, do not hesitate to contact us at support@pagerduty.com.

Posted 3 months ago. Aug 11, 2017 - 18:31 UTC

Resolved
This incident has been resolved and notifications are processing normally. We are fully recovered at this time.
Posted 6 months ago. May 17, 2017 - 22:55 UTC
Monitoring
We have resolved this issue and notifications are processing normally. We are monitoring for any further delays and other services remain functional at this time.
Posted 6 months ago. May 17, 2017 - 22:46 UTC
Identified
We are still investigating the issue affecting delayed notifications. All other services remain functional at this time.
Posted 6 months ago. May 17, 2017 - 22:24 UTC
Investigating
We are currently experiencing an issue causing delay in notification delivery to some accounts. Our web and mobile apps as well as APIs are fully functional at this time.
Posted 6 months ago. May 17, 2017 - 21:47 UTC