On December 15th, 2021 from 00:22 UTC to December 15th, 03:14 UTC, PagerDuty's web UI, mobile UI, and events API experienced an incident that impacted our ability to ingest events and deliver notifications in a timely manner. This issue impacted both our US and EU service regions simultaneously.
By 01:30 UTC December 15th, most functionality in the US service region was restored, and in the EU service region, most functionality was restored by approximately 02:00 UTC. Full restoration of all functionality across both regions was in place by 03:14 UTC.
On December 15th, 2021 at 00:17 UTC, we deployed a DNS configuration change in PagerDuty’s infrastructure that impacted our container orchestration cluster. The change contained a defect, that we did not detect in our testing environments, which immediately caused all services running in the container orchestration cluster to be unable to resolve DNS.
Internal monitoring caught the issues within one minute and the Engineering team mobilized a major incident. A probable cause was identified as dnsmasq misconfiguration. The incident response team deployed a fix to a subset of our production environment and, once verified, the fix was rolled out to the other production environments sequentially with verification for each.
We strongly believe in learning from failure when it does happen. Our engineering team’s focus is on prevention and mitigation. We will be expanding our automated infrastructure testing processes to include enhanced testing of configuration changes in the pre-deploy phase and improving our canary capabilities to reduce risk and validate new changes. For any questions, comments, or concerns, please reach out to firstname.lastname@example.org.