Delayed Event Processing
Incident Report for PagerDuty
Postmortem

Issue processing incoming events

Summary

On July 15, 2017 at 17:05 UTC, PagerDuty suffered a service degradation affecting our event pipeline. During a two hour window, we experienced substantial delays in creating incidents and delivering notifications. We apologize to any customers who were affected by the outage.

What Happened?

At 17:05 UTC, our event storage Cassandra cluster grew unstable due to follow-on effects of earlier work to increase the capacity of the system. Our on-call engineers were immediately notified of the issue, and worked to restore stability to the Cassandra cluster and the notification pipeline. At 19:23 UTC, the system was once again stable and the backlog of notifications was fully processed.

What Are We Doing About This?

We are immediately undertaking a significant engineering effort to reduce the complexity of future scaling efforts for event storage. We apologize if this degradation impacted your team, and recognize that our customers rely on us to promptly and reliably handle their notifications. If you have questions or concerns please contact us at support@pagerduty.com.

Posted 10 months ago. Oct 26, 2017 - 20:44 UTC

Resolved
Our systems have fully recovered, having finished processing the backlog of events. All systems are functioning properly at this time.
Posted about 1 year ago. Jul 15, 2017 - 19:16 UTC
Monitoring
We have resolved the underlying issue and are burning through the backlog of events. No events were dropped and notifications are going out. All other systems are functional.
Posted about 1 year ago. Jul 15, 2017 - 18:50 UTC
Update
We are still investigating this issue. Events are still being queued at this time.
Posted about 1 year ago. Jul 15, 2017 - 18:15 UTC
Update
We are still investigating this issue. Events are still being queued at this time.
Posted about 1 year ago. Jul 15, 2017 - 17:47 UTC
Investigating
We are currently experiencing delays in processing events. Incidents are being queued.
Posted about 1 year ago. Jul 15, 2017 - 17:31 UTC