On July 15, 2017 at 17:05 UTC, PagerDuty suffered a service degradation affecting our event pipeline. During a two hour window, we experienced substantial delays in creating incidents and delivering notifications. We apologize to any customers who were affected by the outage.
At 17:05 UTC, our event storage Cassandra cluster grew unstable due to follow-on effects of earlier work to increase the capacity of the system. Our on-call engineers were immediately notified of the issue, and worked to restore stability to the Cassandra cluster and the notification pipeline. At 19:23 UTC, the system was once again stable and the backlog of notifications was fully processed.
We are immediately undertaking a significant engineering effort to reduce the complexity of future scaling efforts for event storage. We apologize if this degradation impacted your team, and recognize that our customers rely on us to promptly and reliably handle their notifications. If you have questions or concerns please contact us at email@example.com.