On October 17, 2019 at 3:37 PM UTC, PagerDuty experienced an incident that caused a small subset of Global Event Rulesets to revert back to older versions.
As a result, those affected would have experienced events being mis-routed and/or suppressed. No events were dropped during the course of this incident and all rulesets were restored by 8:30 PM UTC.
A data migration was run against all Global Event Rulesets. Due to a bug in our migration process, some routing keys were unexpectedly pointed to previous versions of their corresponding rulesets. These old versions should not have been present in this datastore, but were unintentionally left behind by a previous backup, and ended up being set as the current version for a small subset of rulesets.
The routing keys were identified and set to point to the correct version of their rulesets via a subsequent migration. By the end of the incident at 8:30 PM UTC, the correct version of every ruleset had been restored.
We are fixing the underlying bug in our migration process and purging the cloned rulesets. In addition, we are investigating ways to add monitoring for this type of scenario and to have a rollback mechanism for migrations in place. This will allow us to detect and resolve this issue quicker if it occurs again in the future.
We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org.