On Tuesday, April 11th between 15:20 and 19:05 UTC, a subset of customers on the Free, Professional, and Business plans lost access to alert suppression, dynamic field enrichment extraction, change events, and the Visibility Console. During this time, affected customers were unable to access or use these features via the web and mobile applications or the REST API. Event processing and notification delivery functioned normally and were not impacted.
As part of the launch of PagerDuty AIOps, we rolled out a change to our pricing and packaging, where features that had been previously available to Free, Professional, and Business plans would now be exclusively bundled into the new AIOps package. Although access to these features (alert suppression, dynamic field enrichment extraction, change events, and the Visibility Console) would be removed from the plans, customers who were active users of the impacted features would retain legacy access.
On Tuesday, April 11th at 15:20 UTC, the package changes were pushed to production. By 15:51 UTC, we received the first customer reports of missing features. At 16:13 UTC, the team responsible for rolling out the package changes began their investigation, in particular to a report that a customer had lost access to alert suppression. This account was not in our legacy access list, and the initial investigation indicated that only a very small subset of customer accounts using alert suppression were impacted. At 17:32 UTC, we ran a script to restore access to the impacted accounts, to resolve the issue. We continued to receive reports of issues, including with our DataDog widget and users potentially being logged out of the mobile app. Respective teams investigated these issues separately, not yet knowing that they were related to the loss of legacy access. At 18:12 UTC, we received a customer report of losing feature access, this time from a customer on our legacy access list.
At 18:39 UTC, we spun up a major incident call to coordinate our investigation, and responders noticed this issue was more widespread than initially determined. We quickly found that the legacy access script had an issue where the logic that granted access to affected features was missing a key step to fully propagate the change across all of our systems. After implementing a fix, we re-ran the script, and by 19:05 UTC, impacted customers regained access to all features.
What Are We Doing About This
After this incident, our teams held a thorough incident review to ensure we would have additional controls in place for this type of packaging change. In particular, we will be reviewing the parts of our release process pertaining to identifying impacted and legacy-eligible accounts and for rolling out packaging changes, including enhanced monitoring around rollouts. We have already resolved the root cause of the legacy-enablement script as well as the inadvertent logout of users from the mobile app.
Our teams will conduct a thorough review of potentially impacted accounts to ensure that no further customers are missing access to features.
We sincerely apologize to the customers who lost access to PagerDuty functionality for the duration of the incident. We understand how vital our platform is for our customers, and we apologize for the impact this incident had on you and your teams. If you have any questions, please reach out to email@example.com.