On July 20th, starting at 03:15 UTC and lasting until 18:45 UTC, PagerDuty experienced elevated traffic from Atlassian Jira Cloud servers. Processing this traffic caused a greater load on our infrastructure that impacted our web application's broader functionality. Due to the nature of the traffic we were unable to rate limit the incoming requests and were forced to temporarily block the servers sending these requests in order to protect our infrastructure. Once the block was in place PagerDuty web application functionality was restored.
As a result of the increased traffic there was a disruption of service for customers using the PagerDuty platform.
The increased traffic caused our web application to become resource starved and our platform struggled to handle every request. The incident response team identified the traffic was coming from one of our integration partners and took the dramatic step at 06:25 UTC to completely block that traffic until a time in which we could handle traffic appropriately or stop the traffic. By temporarily blocking the traffic we were able to restore site functionality to all customers while we worked with our partners and mutual customers to rectify the situation.
A patch to our web application was developed to prevent the traffic from slowing the application down and our integration partner was able to identify the cause of the increased traffic. At 15:27 UTC when we were confident the issue wouldn't happen again the traffic from Atlassian's Jira Cloud servers was slowly re-enabled.
During this incident event ingestion and notification dispatching was not affected.
We are working hard to revamp the way this critical integration interacts with the PagerDuty platform by introducing a more modern architecture to the integration which will allow us to have more targeted rate limiting.
We are also investigating other integrations to identify whether the same pattern of traffic could cause a similar issue.
Finally, there are some small changes we are making to the integration that will allow it to be more performant, and will allow our incident response team to better identify network traffic issues.
PagerDuty systems are built to handle increased traffic and typically do without any customer impact. The traffic that caused this incident affected a section of our architecture that has historically handled a very predictable amount of traffic and a large, unexpected increase was able to have this unanticipated impact.
We know our customers depend on the reliability of both our web application and our partner integrations, and we sincerely apologize for the degradation in performance of both during this incident. For any questions, comments, or concerns, please contact us at email@example.com