Delay in the delivery of some webhooks
Incident Report for PagerDuty
Postmortem

Summary

On May 6th from 20:50 UTC until 2:58 UTC on May 7th, we experienced increased error rates in our outbound webhook delivery system. A small number of customers may have experienced delayed deliveries or an increased number of duplicate deliveries. By 22:53 UTC, the incident was partially remediated.  It was fully remediated by 2:58 UTC. The majority of customers were unaffected by this incident.

What Happened

A change in traffic patterns introduced a stream of large webhooks into our webhook delivery system. Due to the volume, at 20:50 UTC, our webhook delivery systems began to experience increased error rates and we started our investigation. By 21:56 UTC, some of our key metrics slipped below our internal targets and triggered a major incident response. At this point we had identified the source of the increased traffic.

At 22:15 UTC, we disabled the source of the increased traffic and our engineers worked to clear the problematic webhooks from the system. Analysis was performed and it was determined that our system needed more time to process these larger webhooks. This change was deployed at 22:53 UTC.

Our monitors showed improved stability metrics, with several still elevated but within acceptable ranges. Our engineers continued to work to clear the backlog of unnecessary webhooks to return everything to full speed operation. 

At 01:00 UTC, we switched from a mostly manual to an automated approach in order to expedite the required remediation of the backlog. We began development of a change to our queuing infrastructure to automatically remove the problematic webhooks.

At 02:36 UTC this change was deployed to our production environments. By 02:40 UTC the remaining metrics were improving, and by 02:58 UTC the system had fully recovered.

What We Are Doing About It

The infrastructure change put in place to resolve this incident remains in place and will prevent very large, sudden increases in traffic from causing service degradation in the future.

We are working to improve our metrics and monitoring to be able to more quickly identify customer impact during an incident.

While this change in traffic pattern was a significant deviation from our normal operation, it was not outside the bounds of what we expect our webhook delivery system to be able to handle.  We are working to improve this system’s ability to scale and provide the level of service that each and every one of you expect.

We love when more and more customers get value out of our webhook capabilities and we sincerely apologize for this degradation in performance. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted May 13, 2021 - 22:43 UTC

Resolved
Normal integration webhook functionality has been restored.
Posted May 06, 2021 - 23:17 UTC
Monitoring
We’ve taken remedial action and we are seeing signs of recovery. We are continuing to monitor.
Posted May 06, 2021 - 23:07 UTC
Investigating
A small portion of outbound webooks and integrations are delayed. We have identified the cause, and we are pursuing a resolution.
Posted May 06, 2021 - 22:30 UTC
This incident affected: Webhooks.