Web/mobile UI and REST API issues
Incident Report for PagerDuty
Postmortem

Summary

On July 20th, starting at 03:15 UTC and lasting until 18:45 UTC, PagerDuty experienced elevated traffic from Atlassian Jira Cloud servers. Processing this traffic caused a greater load on our infrastructure that impacted our web application's broader functionality. Due to the nature of the traffic we were unable to rate limit the incoming requests and were forced to temporarily block the servers sending these requests in order to protect our infrastructure. Once the block was in place PagerDuty web application functionality was restored.

What happened?

As a result of the increased traffic there was a disruption of service for customers using the PagerDuty platform.

The increased traffic caused our web application to become resource starved and our platform struggled to handle every request.  The incident response team identified the traffic was coming from one of our integration partners and took the dramatic step at 06:25 UTC to completely block that traffic until a time in which we could handle traffic appropriately or stop the traffic. By temporarily blocking the traffic we were able to restore site functionality to all customers while we worked with our partners and mutual customers to rectify the situation.

A patch to our web application was developed to prevent the traffic from slowing the application down and our integration partner was able to identify the cause of the increased traffic. At 15:27 UTC when we were confident the issue wouldn't happen again the traffic from Atlassian's Jira Cloud servers was slowly re-enabled.

During this incident event ingestion and notification dispatching was not affected.

What are we doing about this?

We are working hard to revamp the way this critical integration interacts with the PagerDuty platform by introducing a more modern architecture to the integration which will allow us to have more targeted rate limiting.

We are also investigating other integrations to identify whether the same pattern of traffic could cause a similar issue.

Finally, there are some small changes we are making to the integration that will allow it to be more performant, and will allow our incident response team to better identify network traffic issues.

PagerDuty systems are built to handle increased traffic and typically do without any customer impact. The traffic that caused this incident affected a section of our architecture that has historically handled a very predictable amount of traffic and a large, unexpected increase was able to have this unanticipated impact.

We know our customers depend on the reliability of both our web application and our partner integrations, and we sincerely apologize for the degradation in performance of both during this incident. For any questions, comments, or concerns, please contact us at support@pagerduty.com

Posted Aug 02, 2021 - 01:42 UTC

Resolved
The Jira Cloud integration is functioning properly again. This incident has been resolved.
Posted Jul 20, 2021 - 19:10 UTC
Update
Atlassian traffic is fully re-enabled. We will continue to observe for any more issues.
Posted Jul 20, 2021 - 16:06 UTC
Update
We have confirmed that we still require a temporary fix affecting customers with Jira Cloud integration only. Customers using this integration will currently not see any traffic from Jira Cloud to PagerDuty. We continue working on a permanent solution.
Posted Jul 20, 2021 - 06:11 UTC
Update
We are currently working on a longer-term solution of the issue and are continuously monitoring for symptoms.
Posted Jul 20, 2021 - 05:24 UTC
Monitoring
We have deployed the temporary fix and are observing an improvement of previously reported symptoms. The fix will be increasingly affecting the traffic from customers using our Jira Integration.
Posted Jul 20, 2021 - 04:49 UTC
Update
We are investigating the cause of the issue and believe it to be a very high number of requests to one endpoint from a specific integration. We are in process of rolling out a temporary fix to address the current symptoms.
Posted Jul 20, 2021 - 04:10 UTC
Investigating
We are currently experiencing issues with our web UI, mobile app and REST API and actively investigating the root cause.
Posted Jul 20, 2021 - 03:36 UTC
This incident affected: Integrations (Jira Cloud (US), Jira Cloud (EU)).