Issue with Global Routing Rules
Incident Report for PagerDuty
Postmortem

Summary

On October 17, 2019 at 3:37 PM UTC, PagerDuty experienced an incident that caused a small subset of Global Event Rulesets to revert back to older versions.

As a result, those affected would have experienced events being mis-routed and/or suppressed. No events were dropped during the course of this incident and all rulesets were restored by 8:30 PM UTC.

What Happened

A data migration was run against all Global Event Rulesets. Due to a bug in our migration process, some routing keys were unexpectedly pointed to previous versions of their corresponding rulesets. These old versions should not have been present in this datastore, but were unintentionally left behind by a previous backup, and ended up being set as the current version for a small subset of rulesets.

The routing keys were identified and set to point to the correct version of their rulesets via a subsequent migration. By the end of the incident at 8:30 PM UTC, the correct version of every ruleset had been restored.

What Are We Doing About This

We are fixing the underlying bug in our migration process and purging the cloned rulesets. In addition, we are investigating ways to add monitoring for this type of scenario and to have a rollback mechanism for migrations in place. This will allow us to detect and resolve this issue quicker if it occurs again in the future.

We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Oct 23, 2019 - 21:47 UTC

Resolved
As a fix to the issue, we have reverted Global Event rulesets for affected accounts back to its original configuration state from 11:37 am EST. We have also finished monitoring the results to ensure full resolution.
Posted Oct 17, 2019 - 20:42 UTC
Update
We are in the process of identifying all the impacted accounts and reverting Global Event rulesets back to a healthy state.
Posted Oct 17, 2019 - 19:39 UTC
Identified
We've identified the issue, and our Engineering team is currently working on a fix.
Posted Oct 17, 2019 - 18:14 UTC
Investigating
We're currently experiencing an issue with Global Routing Rules for customer accounts where global events are not being routed accordingly. Our Engineering team is currently investigating.
Posted Oct 17, 2019 - 17:54 UTC
This incident affected: Notification Delivery.