Between 00:35 UTC and 15:37 UTC on Thursday, April 28th, the Events API intermittently returned 500s due to the exhaustion of connections to upstream services responsible for handling these requests.
On April 7th, a change to the Events API service was deployed that unintentionally circumvented rate limit logic within this service. This resulted in additional requests being sent to an upstream service, which in turn performed the rate limiting.
On April 27th, traffic to the Events API increased threefold and persisted at that level. Due to the change introduced on April 7th that circumvented the rate limiting logic, this resulted in an increase in upstream requests. Beginning at 00:35 UTC on April 28th, the Events API service intermittently experienced issues connecting to one of its upstream services due to connections being reset. During our investigation we determined that the increase in traffic had caused the number of connections between the Events API and one particular upstream service to accumulate to the point where new connections could not be established. This resulted in roughly 0.078% of requests between 00:35 UTC and 15:37 UTC being rejected by the Events API with a 500 status response. Rejected requests are not processed, so these failed events were not able to update or create incidents. Change events and the REST API were not impacted.
We have since reverted the unintentional change and rate limits are now applied in the Events API service again, resulting in a reduction of connections between the involved services. Furthermore, we are putting additional monitoring in place to detect potential exhaustion of connections earlier, as well as revising our network configurations to prevent the accumulation of connections and subsequent connection resets. We apologize for the impact these failed events may have had on you and your teams. For any questions, comments, or concerns, please reach out to support@pagerduty.com.