On February 28th, 2017 between 20:15 UTC and 22:30 UTC PagerDuty experienced a partial degradation of service. During this period some customers' notifications were delayed. During the period of degraded service, the PagerDuty API, Web and Mobile apps remained available and completely functional.
As a mitigation for the ongoing issue, some customers' inbound events were throttled during the degradation. During the event, 32% of total notifications were delivered with delays, with the average delay being 12 minutes. 3.2% of our user base was impacted. All late notifications were ultimately delivered and normal service was restored by 22:30 UTC. We apologize for any inconvenience this has caused.
Approximately two hours before the PagerDuty incident, Amazon Web Services began to suffer a service disruption in their US-East-1 region. While PagerDuty does not have infrastructure in US-East-1, many of our customers do, resulting in a large increase of traffic to our systems related to that service disruption.
During the same period, one of the third-party providers we use for voice and SMS delivery was suffering a service degradation of their own. Our systems automatically route notifications to secondary and tertiary providers in the event of problems with one provider. In this case, the system worked as designed but the extra volume triggered rate limiting on one of the backup providers.
The combined effects of increased volume and an impaired ability to deliver messages caused an internal component to become unhealthy, exacerbating the delays. During this event, we learned that we were not able to scale the component elastically to improve its ability to handle the volume. As a mitigation step, we employed throttles on the inbound event processing system to reduce rate of incoming data to the overloaded component.
We have identified several steps we will be taking to make sure we can handle large-scale events such as this.
Again, we genuinely apologize if this event impacted your team’s incident visibility or response. The steps outlined above are intended to prevent this type of issue from reoccurring. If you have any questions or concerns please contact us at firstname.lastname@example.org.