On October 2nd from 21:10 UTC to 22:10 UTC PagerDuty experienced a major incident that caused a degradation of event ingestion via the Events API.
During this period some event submissions were rejected with an HTTP 50X response.
A change to the Events API service was deployed that, over time, severely impacted the main Events API service. The issue affecting Events API service resulted in it prematurely closing the connection from the load balancer before returning response to the client for the majority of requests. Eventually this caused all of the servers in the cluster to be marked as unhealthy by the load balancers. During the incident Events API was responding with HTTP 50X response to the Event API clients for roughly 95% of requests with the remaining 5% being served by a subset of the fleet that did not receive the version with the problem. Ultimately there were almost no healthy servers available to accept events and new requests were rejected by the load balancers. To restore service, we rolled back the relevant changes to the Events API service.
We are currently addressing multiple contributing factors for this issue. Planned and currently worked on steps are:
We will do everything we can to learn from this event and make the improvements necessary to uphold the high standard of availability we have to serve the needs of our customers.
Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to firstname.lastname@example.org with these questions.