On May 17, 2017 between 17:00 and 18:00 UTC, we experienced unanticipated side-effects of a system-wide load test. We halted the testing as soon as the effects were detected, but notifications were delayed as we continued to process the load we had already generated.
One of the ways PagerDuty maintains and develops resilient infrastructure is by deliberately exposing our systems to controlled and predictable stress tests in order to preemptively identify and fix potential problems. For the most part, there is little to no customer impact when we conduct these tests.
However, during this incident, unforeseen issues with a new load testing platform created higher than anticipated load on our system as a whole. The problem was identified almost immediately and the test was halted. A backup of queued tasks in our infrastructure dependencies then led to delays in incident notifications to customers over the next hour.
Our resolution time was affected by factors intrinsic to our system architecture. Consequently, PagerDuty engineers have been and are currently working on changes to the supporting infrastructure of our product and have already seen measured improvement in our capacity to handle similar scenarios.
We would like to again apologize for any inconvenience this issue caused. If you have any questions, do not hesitate to contact us at firstname.lastname@example.org.