Issue with Alert Suppression Rules
Incident Report for PagerDuty
Postmortem

Summary

On Tuesday, April 11th between 15:20 and 19:05 UTC, a subset of customers on the Free, Professional, and Business plans lost access to alert suppression, dynamic field enrichment extraction, change events, and the Visibility Console. During this time, affected customers were unable to access or use these features via the web and mobile applications or the REST API.  Event processing and notification delivery functioned normally and were not impacted. 

What Happened

As part of the launch of PagerDuty AIOps, we rolled out a change to our pricing and packaging, where features that had been previously available to Free, Professional, and Business plans would now be exclusively bundled into the new AIOps package. Although access to these features (alert suppression, dynamic field enrichment extraction, change events, and the Visibility Console) would be removed from the plans, customers who were active users of the impacted features would retain legacy access.

On Tuesday, April 11th at 15:20 UTC, the package changes were pushed to production. By 15:51 UTC, we received the first customer reports of missing features. At 16:13 UTC, the team responsible for rolling out the package changes began their investigation, in particular to a report that a customer had lost access to alert suppression. This account was not in our legacy access list, and the initial investigation indicated that only a very small subset of customer accounts using alert suppression were impacted. At 17:32 UTC, we ran a script to restore access to the impacted accounts, to resolve the issue. We continued to receive reports of issues, including with our DataDog widget and users potentially being logged out of the mobile app. Respective teams investigated these issues separately, not yet knowing that they were related to the loss of legacy access. At 18:12 UTC, we received a customer report of losing feature access, this time from a customer on our legacy access list.  

At 18:39 UTC, we spun up a major incident call to coordinate our investigation, and responders noticed this issue was more widespread than initially determined. We quickly found that the legacy access script had an issue where the logic that granted access to affected features was missing a key step to fully propagate the change across all of our systems. After implementing a fix, we re-ran the script, and by 19:05 UTC, impacted customers regained access to all features.

What Are We Doing About This

After this incident, our teams held a thorough incident review to ensure we would have additional controls in place for this type of packaging change. In particular, we will be reviewing the parts of our release process pertaining to identifying impacted and legacy-eligible accounts and for rolling out packaging changes, including enhanced monitoring around rollouts. We have already resolved the root cause of the legacy-enablement script as well as the inadvertent logout of users from the mobile app.

Our teams will conduct a thorough review of potentially impacted accounts to ensure that no further customers are missing access to features.

We sincerely apologize to the customers who lost access to PagerDuty functionality for the duration of the incident. We understand how vital our platform is for our customers, and we apologize for the impact this incident had on you and your teams. If you have any questions, please reach out to support@pagerduty.com.

Posted Apr 18, 2023 - 22:31 UTC

Resolved
We have resolved an incident related Alert Suppression Rules for all PagerDuty customers in both the US and EU service regions. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Apr 11, 2023 - 19:11 UTC
Update
We have identified the issue for the incident where some Alert Suppression Rules may not be working as expected. We will provide further updates within 20 minutes
Posted Apr 11, 2023 - 19:05 UTC
Identified
We are currently investigating an issue where some customers' Alert Suppression Rules may not work as expected.
Posted Apr 11, 2023 - 18:52 UTC
This incident affected: Notification Delivery (Notification Delivery (US), Notification Delivery (EU)).