Events API 500 errors
Incident Report for PagerDuty
Postmortem

Summary

Between 00:35 UTC and 15:37 UTC on Thursday, April 28th, the Events API intermittently returned 500s due to the exhaustion of connections to upstream services responsible for handling these requests.

What Happened?

On April 7th, a change to the Events API service was deployed that unintentionally circumvented rate limit logic within this service. This resulted in additional requests being sent to an upstream service, which in turn performed the rate limiting. 

On April 27th, traffic to the Events API increased threefold and persisted at that level. Due to the change introduced on April 7th that circumvented the rate limiting logic, this resulted in an increase in upstream requests. Beginning at 00:35 UTC on April 28th, the Events API service intermittently experienced issues connecting to one of its upstream services due to connections being reset. During our investigation we determined that the increase in traffic had caused the number of connections between the Events API and one particular upstream service to accumulate to the point where new connections could not be established. This resulted in roughly 0.078% of requests between 00:35 UTC and 15:37 UTC being rejected by the Events API with a 500 status response. Rejected requests are not processed, so these failed events were not able to update or create incidents. Change events and the REST API were not impacted. 

What Are We Doing About This?

We have since reverted the unintentional change and rate limits are now applied in the Events API service again, resulting in a reduction of connections between the involved services. Furthermore, we are putting additional monitoring in place to detect potential exhaustion of connections earlier, as well as revising our network configurations to prevent the accumulation of connections and subsequent connection resets. We apologize for the impact these failed events may have had on you and your teams. For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted May 05, 2022 - 22:24 UTC

Resolved
We have been observing improvements in the behaviour of the Events API and are currently treating this incident as resolved.
Posted Apr 28, 2022 - 07:54 UTC
Update
We are continuing to explore solutions to address the current issue with the Events API elevated error rate, and will be providing further updates.
Posted Apr 28, 2022 - 07:33 UTC
Update
We are currently working on the solution to address the issue with elevated error rate from the Events API. We will share further updates as they become available.
Posted Apr 28, 2022 - 06:50 UTC
Investigating
We are currently experiencing elevated 500 errors in PagerDuty Events API. We are investigating and will follow up shortly when we have more information.
Posted Apr 28, 2022 - 06:21 UTC
This incident affected: Events API (Events API (US)).