Delayed or Duplicate Notifications Affecting Some Customers
Incident Report for PagerDuty
Postmortem

Summary

On February 28th, 2017 between 20:15 UTC and 22:30 UTC PagerDuty experienced a partial degradation of service. During this period some customers' notifications were delayed. During the period of degraded service, the PagerDuty API, Web and Mobile apps remained available and completely functional.

As a mitigation for the ongoing issue, some customers' inbound events were throttled during the degradation. During the event, 32% of total notifications were delivered with delays, with the average delay being 12 minutes. 3.2% of our user base was impacted. All late notifications were ultimately delivered and normal service was restored by 22:30 UTC. We apologize for any inconvenience this has caused.

What Happened?

Approximately two hours before the PagerDuty incident, Amazon Web Services began to suffer a service disruption in their US-East-1 region. While PagerDuty does not have infrastructure in US-East-1, many of our customers do, resulting in a large increase of traffic to our systems related to that service disruption.

During the same period, one of the third-party providers we use for voice and SMS delivery was suffering a service degradation of their own. Our systems automatically route notifications to secondary and tertiary providers in the event of problems with one provider. In this case, the system worked as designed but the extra volume triggered rate limiting on one of the backup providers.

The combined effects of increased volume and an impaired ability to deliver messages caused an internal component to become unhealthy, exacerbating the delays. During this event, we learned that we were not able to scale the component elastically to improve its ability to handle the volume. As a mitigation step, we employed throttles on the inbound event processing system to reduce rate of incoming data to the overloaded component.

What Are We Doing About It?

We have identified several steps we will be taking to make sure we can handle large-scale events such as this.

Specifically:
  • We are making immediate improvements to eliminate specific bottlenecks that were identified as having caused or contributed to the delays on Feb. 28
  • We will be undertaking a review of our internal systems to identify components that cannot scale elastically on-demand and ensure that wherever possible we give them that ability. In the event that this isn’t possible, we will proactively scale up the component to ensure we can handle additional load.
  • We will undertake capacity planning for this internal component to identify bottlenecks in the notification pipeline and remove them or ensure they can handle anticipated future volumes.
  • We will be working with our providers to ensure rate limits are not in place.
  • We will be bringing an extra provider for voice calls on-stream.

Again, we genuinely apologize if this event impacted your team’s incident visibility or response. The steps outlined above are intended to prevent this type of issue from reoccurring. If you have any questions or concerns please contact us at support@pagerduty.com.

Posted Mar 07, 2017 - 21:24 UTC

Resolved
We are fully recovered and all systems are operational. During this incident, our web and mobile apps as well as our REST and events APIs were fully operational, however phone, sms, email and push notifications were delayed across 3.5% of our customers.
Posted Mar 01, 2017 - 01:11 UTC
Monitoring
We are continuing to monitor the issue.
Posted Feb 28, 2017 - 23:59 UTC
Update
We are continuing to monitor the issue.
Posted Feb 28, 2017 - 22:32 UTC
Update
Notification delivery time has recovered, we are continuing to monitor the issue.
Posted Feb 28, 2017 - 22:01 UTC
Update
We are still experiencing issues with delayed notifications. Our engineers are working towards recovery, and we’ll continue updating this page as new information becomes available.
Posted Feb 28, 2017 - 21:34 UTC
Update
Notifications are still being sent with a delay at this time. PagerDuty engineers are continuing their investigation and working to reduce and eliminate delays as quickly as possible.
Posted Feb 28, 2017 - 21:03 UTC
Update
There is still a delay sending notifications at this time. We're continuing our investigation and will provide updates as they become available.
Posted Feb 28, 2017 - 20:33 UTC
Update
Notifications are still delayed at this time. Our engineers are still investigating this issue, and we’ll continue to update this page as more information becomes available.
Posted Feb 28, 2017 - 20:13 UTC
Identified
We are experiencing issues with delayed push and email notifications at this time as well, in addition to delayed phone and SMS notifications. We are continuing our investigation and working to resolve this issue as quickly as possible.
Posted Feb 28, 2017 - 19:56 UTC
Investigating
Some of our phone and SMS providers are sending PagerDuty customers delayed or duplicate notifications. Our engineers are investigating this issue and we will update as more information becomes available. We apologize for the inconvenience.
Posted Feb 28, 2017 - 19:51 UTC