Issues with events ingestion, UI and REST API
Incident Report for PagerDuty
Postmortem

Summary

On December 15th, 2021 from 00:22 UTC to December 15th, 03:14 UTC, PagerDuty's web UI, mobile UI, and events API experienced an incident that impacted our ability to ingest events and deliver notifications in a timely manner. This issue impacted both our US and EU service regions simultaneously.

By 01:30 UTC December 15th, most functionality in the US service region was restored, and in the EU service region, most functionality was restored by approximately 02:00 UTC. Full restoration of all functionality across both regions was in place by 03:14 UTC.

What Happened?

On December 15th, 2021 at 00:17 UTC, we deployed a DNS configuration change in PagerDuty’s infrastructure that impacted our container orchestration cluster. The change contained a defect, that we did not detect in our testing environments, which immediately caused all services running in the container orchestration cluster to be unable to resolve DNS.

Internal monitoring caught the issues within one minute and the Engineering team mobilized a major incident. A probable cause was identified as dnsmasq misconfiguration. The incident response team deployed a fix to a subset of our production environment and, once verified, the fix was rolled out to the other production environments sequentially with verification for each.

What Are We Doing About This?

We strongly believe in learning from failure when it does happen. Our engineering team’s focus is on prevention and mitigation. We will be expanding our automated infrastructure testing processes to include enhanced testing of configuration changes in the pre-deploy phase and improving our canary capabilities to reduce  risk and validate new changes. For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted Dec 23, 2021 - 23:02 UTC

Resolved
We have confirmed full recovery of our systems - this incident is now resolved.
Posted Dec 15, 2021 - 03:15 UTC
Update
We have confirmed the recovery of our systems in both the EU and US service regions, will continue to monitor their recovery.
Posted Dec 15, 2021 - 03:04 UTC
Monitoring
We have confirmed the recovery of all of our services in the US service region, and also observing strong signs of recovery in the EU region, where the fix has now been fully deployed. We will continue sharing updates as we monitor the state of our systems.
Posted Dec 15, 2021 - 02:40 UTC
Update
We are observing strong signs of recovery in the US service region and some initial signs in the EU service region. We will continue deploying the fix for the EU region and monitoring the progress of the recovery.
Posted Dec 15, 2021 - 02:22 UTC
Update
The deploy of the fix has completed in the US service region and we are monitoring the state of our services to confirm correct recovery. We are in progress of rolling out the fix in the EU service region.
Posted Dec 15, 2021 - 01:51 UTC
Update
We are seeing signs of recovery in the US service region but are yet to observe full resolution of the issue. We are continuing the rollout of the fix.
Posted Dec 15, 2021 - 01:29 UTC
Identified
We have identified the current behaviour as related to a DNS issue and are rolling out a fix.
Posted Dec 15, 2021 - 01:04 UTC
Investigating
We are observing issues with events ingestion, application UI and web API. We are investigating the potential root cause.
Posted Dec 15, 2021 - 00:39 UTC
This incident affected: Events API (Events API (US), Events API (EU)), REST API (REST API (US), REST API (EU)), Web Application (Web Application (US), Web Application (EU)), and Mobile Application (Mobile Application (US), Mobile Application (EU)).