On July 18, 2019, from 15:25 UTC to 16:43 UTC, PagerDuty experienced a delay in its ability to deliver 24.4% of notifications within our SLA. During this time, no events or notifications were dropped and the longest delay was 9 minutes beyond SLA.
At 15:25 UTC two nodes of our distributed storage system responsible for notification delivery stopped responding to requests due to performing excessive garbage collection that completely halted the storage system application. Due to this, our system relying on this datastore experienced problems communicating to the entire cluster, which resulted in a delay in processing notifications.
At the time of the notification delay, fraudulent use of PagerDuty’s functionality to send out test notifications was discovered. To mitigate its impact on legitimate notifications, the ability to send out test notifications was temporarily disabled for all customers.
While rolling out the changes to block fraudulent IP addresses and disabling the test notification functionality, the affected nodes of the storage systems recovered and subsequently, our notification subsystem caught up fully.
We are actively working on improving the Java Virtual Machine configuration of the affected system so the storage application is not impacted as severely. In addition, we have started work on migrating the storage system to a better topology, which will reduce the impact on individual nodes slowing down the entire cluster.
We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.