On Saturday February 11, beginning at 04:51 UTC and ending at 06:40 UTC, PagerDuty experienced an incident that impacted customers in the US service region. Impacted customers in this region saw either out-of-date or missing incident details during this time.
For approximately two hours beginning at 04:51 UTC on February 11, seven replicas in the US service region's production database cluster experienced sporadic replication delays of up to 8 minutes. The replicas were all assigned to the pool responsible for providing incident details.
On February 10 at 16:30 UTC, we started a database migration in this cluster which we expected to take about 120 hours to complete. This is not unusual, and large migrations happen about once a month. Hours later, unrelated to the migration, we decommissioned one database server in the cluster due to a hardware issue.
Our database migration tool, gh-ost, normally pauses itself when it detects lagging replicas, but due to the infrastructure change made earlier in the day, the replication lag detection process used by the migration tool began failing silently.
The next morning, at 04:51 UTC on February 11, a subset of replicas then began exhibiting sporadic replication delays: They would lag for a few minutes, then the lag would resolve as the migration process would sometimes throttle itself. This happened for about 20 minutes until replication lag monitors began alerting. Once alerted, we began diagnosing the problem, which took a bit longer than expected because the lag was limited to a small subset of replicas. We then discovered a correlation between writes from the migration and replication lag: When the migration writes were being processed on the subset of replicas, the replication lag grew. Since PagerDuty reads your incident details off of these replicas, whenever a replica lagged, it would present you with out-of-date or missing incident details.
After determining that the replication lag was being caused by the migration, we safely stopped the migration. Within ten minutes, the replicas processed the backlog of writes, at which time, all incident detail information was current and viewable by customers.
We have already made the following changes to better handle this situation:
We sincerely apologize for the delayed incident details you or your teams experienced. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.