On Monday, September 20th, from 21:44 UTC until 22:00 UTC, PagerDuty experienced an incident that caused delays in delivery of status update notifications and also caused the inability to view or modify incident subscribers. During this time, status update notifications were delayed by a maximum of 15 minutes. Additionally, viewing or modifying status update notification subscriptions via PagerDuty’s web and mobile applications was not possible. All other types of notifications (on-call handoff notifications, assignment notifications, and responder requests) were unaffected during this time.
We began the process of removing an unused database cluster. The cluster in question was linked to another cluster that was in use by the system. The removal caused the decommissioning of active instances in that in-use cluster. Restarting the database instances that serve the application restored functionality to the service. No data loss occurred.
Process improvements: We are reviewing our operations documentation to ensure proper configuration when performing operational tasks. We are also instituting stricter policies on review of such operations.
Tooling: We are collaborating between teams at PagerDuty to create an automation that would prevent the execution of such operations under certain conditions.
We’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.