Incidents List
Incident Report for PagerDuty
Postmortem

Summary

On Oct 27, 2020, at 6:26 PM UTC PagerDuty experienced a major incident due to database replication lag. The lag delayed up to date information from appearing in the UI and caused on-call off-call (OCHONs) notifications to be sent in the middle of some shifts rather than at the beginning or end. The underlying cause of the lag was load placed on the primary database server due to a one-time job to realign schedules for the upcoming (at the time) shift off of daylight saving time.

What Happened

A script to fix schedules for DST-related changes created a greater than expected load on our infrastructure which resulted in delayed replication. The delayed replication further exacerbated the issue by creating a gap in how we measured the start and end times of shifts altered by the scripts. To restore functionality, we canceled the script and brought in more hosts to reduce replication lag time.

What We Are Doing About This

We are currently addressing multiple contributing factors for this issue. Planned and currently worked on steps are:

  • Improved Tooling we are working on improving the tooling around running backfill scripts in order to prevent them from adversely affecting our production DB.
  • Improved Shift Calculation we have replaced time calculation reliance from the current time to an independent time source to prevent drift mentioned above.

We’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.

Posted Nov 23, 2020 - 18:09 UTC

Resolved
A fix has been implemented and we are fully recovered from this incident. We will continue to monitor the situation.
Posted Oct 27, 2020 - 18:41 UTC
Update
We have identified issues in returning recently created incidents from the REST API. We have implemented a fix and are monitoring for signs of recovery
Posted Oct 27, 2020 - 18:28 UTC
Identified
We have taken counter measures to speed up processing for the incidents list and are monitoring the impact of our actions.
Posted Oct 27, 2020 - 18:14 UTC
Investigating
We are currently experiencing issues with the list of incidents in the UI. Incidents are still being created and processed. We are investigating.
Posted Oct 27, 2020 - 17:38 UTC
This incident affected: REST API, Web Application, and Mobile Application.