PagerDuty Functionality Issues
Incident Report for PagerDuty
Postmortem

Summary:

On Saturday August 21st, between 10:40 UTC and 13:02 UTC, we experienced a 2 hour 22 minute SEV-1 in the US Service Region. During this incident, 100% of REST API and PagerDuty website requests were rejected for 40 minutes, 0.2% of notifications were dropped, 17.1% notifications were delivered out of SLA and a significant number of events were throttled. Our EU Service Region was not affected.

What Happened:

The primary database node in our web cluster experienced a disk failure starting at 10:40 UTC which caused the file-system to enter a read-only state. The disk being unwriteable led the database process on the host to crash. PagerDuty has alternative mechanisms to alert incident responders to an issue even when PagerDuty itself is down. However, due to PagerDuty’s unavailability and a breakdown in our alternative alerting processes this incident response took longer than anticipated to bring our systems back to a stable state, and led to a less than timely resolution. 

The database recovery took place at  11:20 UTC, but the SEV-1 continued until 13:02 UTC (1 hour and 42 minutes) while our systems were processing the backlog of events that had come through during the outage. In this period 17.1% of notifications were delivered out of SLA but the REST API and PagerDuty website were working normally. Furthermore, events continued to be throttled until 15:40 UTC leading to more events being throttled. At 15:40 UTC we had recovered fully and the incident was resolved.

What we are doing about this:

We are communicating directly with customers impacted by this incident.

We are updating our alerts and monitoring to enable quicker notifications regarding these types of hardware failures which will greatly reduce their impact.

Simultaneously, we are reviewing and further improving our fallback scenarios in the extremely rare case PagerDuty is not available. 

We understand how important and critical our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry.  If you have any questions, please reach out to support@pagerduty.com.

Posted Aug 30, 2021 - 20:28 UTC

Resolved
We have processed all stale events and are considering this incident fully resolved.
Posted Aug 21, 2021 - 15:28 UTC
Update
We are making good progress with processing stale events affected by the incident and will share another update once finished. All new events have been getting processed at full speed after our systems recovered.
Posted Aug 21, 2021 - 15:14 UTC
Update
Our systems are now in full recovery, but we are continually processing a number of stale events and monitoring.
Posted Aug 21, 2021 - 12:34 UTC
Monitoring
We are currently in the recovery stage, while our systems are still catching up on processing the unprocessed events. We are also monitoring for the effectiveness of the fix.
Posted Aug 21, 2021 - 12:03 UTC
Investigating
The Engineering Team has picked up an issue with the PagerDuty Platform which is back up now. We are working actively to resolve the underlying issue. We plan to get all things fixed as soon as possible and would update you accordingly.
Posted Aug 21, 2021 - 11:55 UTC
This incident affected: REST API (REST API (US)), Web Application (Web Application (US)), and Events API (Events API (US)).