On Saturday August 21st, between 10:40 UTC and 13:02 UTC, we experienced a 2 hour 22 minute SEV-1 in the US Service Region. During this incident, 100% of REST API and PagerDuty website requests were rejected for 40 minutes, 0.2% of notifications were dropped, 17.1% notifications were delivered out of SLA and a significant number of events were throttled. Our EU Service Region was not affected.
The primary database node in our web cluster experienced a disk failure starting at 10:40 UTC which caused the file-system to enter a read-only state. The disk being unwriteable led the database process on the host to crash. PagerDuty has alternative mechanisms to alert incident responders to an issue even when PagerDuty itself is down. However, due to PagerDuty’s unavailability and a breakdown in our alternative alerting processes this incident response took longer than anticipated to bring our systems back to a stable state, and led to a less than timely resolution.
The database recovery took place at 11:20 UTC, but the SEV-1 continued until 13:02 UTC (1 hour and 42 minutes) while our systems were processing the backlog of events that had come through during the outage. In this period 17.1% of notifications were delivered out of SLA but the REST API and PagerDuty website were working normally. Furthermore, events continued to be throttled until 15:40 UTC leading to more events being throttled. At 15:40 UTC we had recovered fully and the incident was resolved.
What we are doing about this:
We are communicating directly with customers impacted by this incident.
We are updating our alerts and monitoring to enable quicker notifications regarding these types of hardware failures which will greatly reduce their impact.
Simultaneously, we are reviewing and further improving our fallback scenarios in the extremely rare case PagerDuty is not available.
We understand how important and critical our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to firstname.lastname@example.org.