Delays in Notification Delivery
Incident Report for PagerDuty
Postmortem

Summary

On October 25, 2018, 18:39 UTC, we experienced an issue with Notification Delivery as well as Web UI rendering. Web UI actions were being processed, but stale data was displayed to customers for the duration of the incident. During this time, no events or notifications were dropped and the longest delay was 11 minutes for notifications.

MySQL Replication at PagerDuty

On October 25, 2018, 18:39 UTC, PagerDuty engineers made a change to our core MySQL database cluster that resulted in replication stopping. With replication stopping, this caused downstream services, such as notifications, to no longer process. The issue was mitigated at October 25, 2018, 19:07 UTC by skipping certain replication events to resume replication and the downstream services to start working again. PagerDuty leverages multiple MySQL clusters to support many of our services. To scale out our database capacity, we leverage MySQL Replication to multiple nodes to distribute reads. As part of the change to our database nodes, we ran certain DDL statements that caused replication to stop due to this bug. This was addressed by skipping over the duplicated DDL statements affected by the bug.

What we are doing about this

We have already decommissioned the database cluster that had the bug in favor of a bug-free implementation. Also, we have already set up alerting for replication crashes. While we had alerting in place for replication lag, this technically was not a lag event as replication had stopped completely. Additionally, we have learned that some of our downstream services are overly sensitive to replication lag/stop events. While these events happen rarely, we will be investigating where we can reduce the coupling between a service operating healthily when there is replication lag.

Posted 6 days ago. Nov 06, 2018 - 22:40 UTC

Resolved
We have now fully recovered from issues with Notification Delivery and with performance issues within the Web Application.
Posted 18 days ago. Oct 25, 2018 - 19:07 UTC
Update
We are continuing to monitor for any further issues.
Posted 18 days ago. Oct 25, 2018 - 18:59 UTC
Monitoring
We are currently monitoring Notification Delivery and are nearing full recovery with the solution that was implemented.
Posted 18 days ago. Oct 25, 2018 - 18:53 UTC
Investigating
We are experiencing delays in Notification Delivery for all Accounts. Our engineering teams have identified the root cause and have implemented a solution, which has begun to take effect.
Posted 18 days ago. Oct 25, 2018 - 18:50 UTC
This incident affected: Notification Delivery and Web Application.