On October 25, 2018, 18:39 UTC, we experienced an issue with Notification Delivery as well as Web UI rendering. Web UI actions were being processed, but stale data was displayed to customers for the duration of the incident. During this time, no events or notifications were dropped and the longest delay was 11 minutes for notifications.
On October 25, 2018, 18:39 UTC, PagerDuty engineers made a change to our core MySQL database cluster that resulted in replication stopping. With replication stopping, this caused downstream services, such as notifications, to no longer process. The issue was mitigated at October 25, 2018, 19:07 UTC by skipping certain replication events to resume replication and the downstream services to start working again. PagerDuty leverages multiple MySQL clusters to support many of our services. To scale out our database capacity, we leverage MySQL Replication to multiple nodes to distribute reads. As part of the change to our database nodes, we ran certain DDL statements that caused replication to stop due to this bug. This was addressed by skipping over the duplicated DDL statements affected by the bug.
We have already decommissioned the database cluster that had the bug in favor of a bug-free implementation. Also, we have already set up alerting for replication crashes. While we had alerting in place for replication lag, this technically was not a lag event as replication had stopped completely. Additionally, we have learned that some of our downstream services are overly sensitive to replication lag/stop events. While these events happen rarely, we will be investigating where we can reduce the coupling between a service operating healthily when there is replication lag.