Delayed Notification Delivery
Incident Report for PagerDuty
Postmortem

Summary

On July 18, 2019, from 15:25 UTC to 16:43 UTC, PagerDuty experienced a delay in its ability to deliver 24.4% of notifications within our SLA. During this time, no events or notifications were dropped and the longest delay was 9 minutes beyond SLA.

What Happened?

At 15:25 UTC two nodes of our distributed storage system responsible for notification delivery stopped responding to requests due to performing excessive garbage collection that completely halted the storage system application. Due to this, our system relying on this datastore experienced problems communicating to the entire cluster, which resulted in a delay in processing notifications.

At the time of the notification delay, fraudulent use of PagerDuty’s functionality to send out test notifications was discovered. To mitigate its impact on legitimate notifications, the ability to send out test notifications was temporarily disabled for all customers.

While rolling out the changes to block fraudulent IP addresses and disabling the test notification functionality, the affected nodes of the storage systems recovered and subsequently, our notification subsystem caught up fully.

What Are We Doing About This?

We are actively working on improving the Java Virtual Machine configuration of the affected system so the storage application is not impacted as severely. In addition, we have started work on migrating the storage system to a better topology, which will reduce the impact on individual nodes slowing down the entire cluster.

We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted 29 days ago. Jul 25, 2019 - 16:11 UTC

Resolved
We have now fully recovered from issues with notification delivery.
Posted about 1 month ago. Jul 18, 2019 - 16:37 UTC
Monitoring
Engineering has identified the issue, and has deployed a fix to remediate delayed notification delivery. We are currently monitoring for signs of recovery.
Posted about 1 month ago. Jul 18, 2019 - 16:31 UTC
Investigating
We are currently experiencing an issue with delayed notification delivery. Engineering is actively investigating.
Posted about 1 month ago. Jul 18, 2019 - 16:11 UTC
This incident affected: Notification Delivery.