Intermittent failure of mobile app actions
Incident Report for PagerDuty
Postmortem

Summary

Between 12:16 and 12:38 UTC on Dec 22, 2021, PagerDuty experienced an incident during which approximately 10% of actions taken in the PagerDuty iOS and Android mobile apps failed. 

Only PagerDuty’s mobile apps were affected by the incident. The notification pipeline, Events and REST APIs, and the Web UI continued to work as normal.

What Happened

The incident occurred during a period of high load related to a power outage at a major cloud provider, which significantly increased the volume of mobile responder actions going through PagerDuty. The increased volume exceeded capacity in the part of our backend fleet which is dedicated to serving the mobile apps. Since the mobile apps only use that dedicated part of the fleet, other parts of PagerDuty were unaffected.

While we monitor for mobile app errors, there was a bug in the monitor; it only counted “500” error responses, and not on all 500-class errors (like 502 and 504 errors). The capacity issue manifested as a brief spike of 500 error responses from the mobile backend, followed by sustained 504 errors, and so to the on-call it appeared as the issue had resolved itself immediately following the initial spike. 

The request volume dropped below the capacity limit as the load peak from the cloud provider incident leveled out, at which point PagerDuty service was restored. Regardless, once we were aware of the capacity issue, we scaled up our backend fleet to prevent a recurrence. 

What We’re Doing

We strongly believe in learning from failure when it does happen. Our engineering team’s focus is on prevention and mitigation. To improve our reliability in the future, we have or will:

  • Fix our monitoring to account for all 500-class errors, and to avoid false positives which may have led to the on-call engineer dismissing the alert as normal
  • Ensure we are properly monitoring capacity on our mobile backend fleet
  • Implement auto-scaling of the mobile backend fleet, as part of our upcoming migration from Nomad to Kubernetes
  • Proactively monitor the health of PagerDuty systems during high-load Internet outages that affect many PagerDuty customers

For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted Jan 06, 2022 - 22:22 UTC

Resolved
Between 12:16 and 12:38 UTC on Dec 22, PagerDuty experienced an incident during which 10% of actions on the PagerDuty mobile apps would fail, either immediately or by timing out. The incident was associated with a peak in load resulting from many PagerDuty customers simultaneously experiencing incidents.

At the same time, we also temporarily disabled the Past Incidents feature to prevent delays to core incident response functionality.

The rest of the PagerDuty web interface, the PagerDuty REST API, event ingestion, and notifications were not affected by this incident.

Following the incident, we have scaled up our mobile app backend, and re-enabled Past Incidents. As part of the incident investigation we have already identified opportunities to better detect and respond to load related to major internet outages via our monitoring and procedures.
Posted Dec 22, 2021 - 12:30 UTC