Between 12:16 and 12:38 UTC on Dec 22, 2021, PagerDuty experienced an incident during which approximately 10% of actions taken in the PagerDuty iOS and Android mobile apps failed.
Only PagerDuty’s mobile apps were affected by the incident. The notification pipeline, Events and REST APIs, and the Web UI continued to work as normal.
The incident occurred during a period of high load related to a power outage at a major cloud provider, which significantly increased the volume of mobile responder actions going through PagerDuty. The increased volume exceeded capacity in the part of our backend fleet which is dedicated to serving the mobile apps. Since the mobile apps only use that dedicated part of the fleet, other parts of PagerDuty were unaffected.
While we monitor for mobile app errors, there was a bug in the monitor; it only counted “500” error responses, and not on all 500-class errors (like 502 and 504 errors). The capacity issue manifested as a brief spike of 500 error responses from the mobile backend, followed by sustained 504 errors, and so to the on-call it appeared as the issue had resolved itself immediately following the initial spike.
The request volume dropped below the capacity limit as the load peak from the cloud provider incident leveled out, at which point PagerDuty service was restored. Regardless, once we were aware of the capacity issue, we scaled up our backend fleet to prevent a recurrence.
We strongly believe in learning from failure when it does happen. Our engineering team’s focus is on prevention and mitigation. To improve our reliability in the future, we have or will:
For any questions, comments, or concerns, please reach out to firstname.lastname@example.org.