Accessibility Issues
Incident Report for PagerDuty

(Cross-posted from the postmortem published by Tim Armandpour, SVP, Product Development, on our blog):

Following up on our blog post from Monday, we wanted to share the actions we will be taking based on our initial root cause analysis.

Primary and Secondary Root Causes

As we looked through our timeline of events during the outage, we discovered that there were two issues:

  1. Our failover approach to DNS problems
  2. The quality of monitoring to assess the end-to-end customer experience.

As we have talked about in the past, we prefer to design our systems in a multi-master architecture as opposed to a failover architecture to achieve continuous availability. This approach while requiring significant systems design investment, has the benefits of: predictable capacity in degraded scenarios, forcing increased automation, and making incremental changes easier and safer. We did not have multi-master architecture in place for our DNS systems however. Instead, we required a manual failover to a secondary provider during the outage.

Measuring the end-to-end customer experience is always a challenge in the midst of DNS problems. After all, if a customer cannot talk to your systems, how can you tell what their experience is? We rely heavily on monitoring and alerting on every part of PagerDuty’s services. We have teams of engineers dedicated to making sure that each part of the customer experience is what our customers expect. During this outage, we were unable to properly diagnose customer facing problems due to the fact that customers were not able to reach our systems. This led to an increased resolution time for our customers.

Follow-up Actions

In the coming weeks, we are looking at making several enhancements to our infrastructure, processes, and automation. These enhancements will help decrease the chance of a system-wide outage for the same root causes identified.

Taking a Multi-Master Approach towards DNS

Our top priority underway is redesigning and implementing a new DNS architecture that allows for multiple DNS providers to be leveraged in a multi-master approach. We are updating our internal tooling and automation to make sure that both our external customer facing DNS records are leveraging multiple DNS providers, as well as making sure our internal servers leverage a similar system.

Auditing all DNS TTLs

We have multiple endpoints that our customers use to interact with PagerDuty: Our website, our APIs, and our mobile applications. To ensure a consistent experience across all of these, we will be auditing DNS TTLs for our zones, including NS and SOA records for each zone.

Runbook for DNS Cache Flushing

Many public DNS providers offer the ability to proactively flush caches when records have changed. For example, Google provides this functionality via a web interface. We will be examining what our customer’s top DNS providers are, and determining the steps for each provider to proactively flush caches to provide up to date records faster when possible.

Improve Real User Monitoring

We leverage a combination of both internal monitoring systems and external providers. During this outage, we used these monitoring systems to assess what the customer impact was and determine how best to prioritize resolution steps. Unfortunately, most of the internal systems are designed to be a view from within our infrastructure, and did not adequately describe our end-to-end user experience, especially for our customers on the east and west coasts of the US. We will invest additional resources in global monitoring that takes an external and customer experience view of our systems and overall service offering. This includes our Website, APIs, and Mobile experiences, and our Notification experience as well.

Improve Prioritization of Resolution Steps

At PagerDuty, we leverage a service oriented architecture to support multiple features that our customers leverage. For the majority of our customer facing incidents, there is only one part of our service that becomes affected when a disruption of service occurs. With a central component like DNS not being available, multiple components of our service were impacted. When bringing our services back up in the future, we need the ability to prioritize the most critical and important services that matter most to our customers.

Improve Multi-team Response Process

As called out in the previous section, we have multiple teams on-call continuously for helping PagerDuty works properly. While we leverage our own product to assist us with our people orchestration efforts, we did not have all of the supporting tooling in place for certain teams involved. We plan to implement processes and improve upon our best practices so that each team is able to address problems in their own services effectively.


This past Friday was a difficult day for nearly every on-call engineer. At PagerDuty, we take great pride in providing a service that we know thousands of customers rely on. We did not meet the high expectations that we set for ourselves, and we are taking critical steps to continuously enhance the reliability and availability of our systems. From this experience, I am confident we will provide an even more reliable service that will be there when our customers need us the most.

As always, if you have any questions or concerns, please do not hesitate to follow-up with our Support team at

Posted almost 2 years ago. Oct 27, 2016 - 00:26 UTC

We will be issuing a full post mortem and next steps here over the next few days.
Posted almost 2 years ago. Oct 24, 2016 - 19:34 UTC
The issue with duplicate notifications has been resolved. If you are experiencing this problem, please reach out to us at
Posted almost 2 years ago. Oct 21, 2016 - 23:38 UTC
We are aware of some customers receiving duplicate phone and SMS notifications. We are actively working on resolving this issue.
Posted almost 2 years ago. Oct 21, 2016 - 22:20 UTC
Acknowledgements and resolutions should now be working correctly. If you are experiencing any issues, please reach out to us at
Posted almost 2 years ago. Oct 21, 2016 - 20:24 UTC
At this time, notifications are no longer delayed. We are working on correcting the inability to acknowledge or resolve incidents via phone and SMS.
Posted almost 2 years ago. Oct 21, 2016 - 20:00 UTC
We are still investigating issues related to customers not able to acknowledge or resolve incidents properly. If you are having issues reaching any address please flush your DNS cache to resolve the issue.
Posted almost 2 years ago. Oct 21, 2016 - 19:32 UTC
We're still investigating issues related to acknowledging and resolving incidents by phone and SMS.
Posted almost 2 years ago. Oct 21, 2016 - 18:32 UTC
Notifications are delayed at this time. We are working to resolve the issue.
Posted almost 2 years ago. Oct 21, 2016 - 17:37 UTC
We are still investigating issues related to accessibility of our services due to DNS failures. Some customers may not be able to connect with We are working on alternative measures.
Posted almost 2 years ago. Oct 21, 2016 - 17:10 UTC
We are investigating an issue with the accessibility of PagerDuty.
Posted almost 2 years ago. Oct 21, 2016 - 16:25 UTC