Issue Identified with displaying incident Data
Incident Report for PagerDuty
Postmortem

Summary

On Saturday February 11, beginning at 04:51 UTC and ending at 06:40 UTC, PagerDuty experienced an incident that impacted customers in the US service region. Impacted customers in this region saw either out-of-date or missing incident details during this time.

What Happened

For approximately two hours beginning at 04:51 UTC on February 11, seven replicas in the US service region's production database cluster experienced sporadic replication delays of up to 8 minutes. The replicas were all assigned to the pool responsible for providing incident details.

On February 10 at 16:30 UTC, we started a database migration in this cluster which we expected to take about 120 hours to complete. This is not unusual, and large migrations happen about once a month. Hours later, unrelated to the migration, we decommissioned one database server in the cluster due to a hardware issue. 

Our database migration tool, gh-ost, normally pauses itself when it detects lagging replicas, but due to the infrastructure change made earlier in the day, the replication lag detection process used by the migration tool began failing silently. 

The next morning, at 04:51 UTC on February 11, a subset of replicas then began exhibiting sporadic replication delays: They would lag for a few minutes, then the lag would resolve as the migration process would sometimes throttle itself. This happened for about 20 minutes until replication lag monitors began alerting. Once alerted, we began diagnosing the problem, which took a bit longer than expected because the lag was limited to a small subset of replicas. We then discovered a correlation between writes from the migration and replication lag: When the migration writes were being processed on the subset of replicas, the replication lag grew. Since PagerDuty reads your incident details off of these replicas, whenever a replica lagged, it would present you with out-of-date or missing incident details.

After determining that the replication lag was being caused by the migration, we safely stopped the migration. Within ten minutes, the replicas processed the backlog of writes, at which time, all incident detail information was current and viewable by customers.

What Are We Doing About This

We have already made the following changes to better handle this situation:

  • Detecting and updating the replica list gh-ost uses when an infrastructure change is made
  • Alerting sooner when replication begins lagging
  • Enhancing our migration notification process by adding additional graphs that more clearly display the current state of migrations

We sincerely apologize for the delayed incident details you or your teams experienced. We understand how vital our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Mar 08, 2023 - 18:20 UTC

Resolved
We have identified and resolved an incident where PagerDuty customers in the US service regions were experiencing issues with the delayed display of incident data. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Feb 11, 2023 - 06:42 UTC
Update
We are still continuing to investigate an incident where PagerDuty customers in the US service regions are experiencing issues with the delayed display of incident data. We will provide further updates within 20 minutes.
Posted Feb 11, 2023 - 06:31 UTC
Update
We continue to investigate an incident where PagerDuty customers in the US service regions are experiencing issues with the delayed display of incident data. We will provide further updates within 20 minutes.
Posted Feb 11, 2023 - 06:11 UTC
Identified
We are investigating an incident where PagerDuty customers in the US service regions are experiencing issues with the delayed display of incident data. We will provide further updates within 20 minutes.
Posted Feb 11, 2023 - 05:50 UTC
This incident affected: Incident Timeline and Alert Logs (Incident Timeline and Alert Logs (US)).