On May 6th from 20:50 UTC until 2:58 UTC on May 7th, we experienced increased error rates in our outbound webhook delivery system. A small number of customers may have experienced delayed deliveries or an increased number of duplicate deliveries. By 22:53 UTC, the incident was partially remediated. It was fully remediated by 2:58 UTC. The majority of customers were unaffected by this incident.
A change in traffic patterns introduced a stream of large webhooks into our webhook delivery system. Due to the volume, at 20:50 UTC, our webhook delivery systems began to experience increased error rates and we started our investigation. By 21:56 UTC, some of our key metrics slipped below our internal targets and triggered a major incident response. At this point we had identified the source of the increased traffic.
At 22:15 UTC, we disabled the source of the increased traffic and our engineers worked to clear the problematic webhooks from the system. Analysis was performed and it was determined that our system needed more time to process these larger webhooks. This change was deployed at 22:53 UTC.
Our monitors showed improved stability metrics, with several still elevated but within acceptable ranges. Our engineers continued to work to clear the backlog of unnecessary webhooks to return everything to full speed operation.
At 01:00 UTC, we switched from a mostly manual to an automated approach in order to expedite the required remediation of the backlog. We began development of a change to our queuing infrastructure to automatically remove the problematic webhooks.
At 02:36 UTC this change was deployed to our production environments. By 02:40 UTC the remaining metrics were improving, and by 02:58 UTC the system had fully recovered.
The infrastructure change put in place to resolve this incident remains in place and will prevent very large, sudden increases in traffic from causing service degradation in the future.
We are working to improve our metrics and monitoring to be able to more quickly identify customer impact during an incident.
While this change in traffic pattern was a significant deviation from our normal operation, it was not outside the bounds of what we expect our webhook delivery system to be able to handle. We are working to improve this system’s ability to scale and provide the level of service that each and every one of you expect.
We love when more and more customers get value out of our webhook capabilities and we sincerely apologize for this degradation in performance. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org.