Event Ingestion Delays
Incident Report for PagerDuty
Postmortem

Summary

On October 2nd from 21:10 UTC to 22:10 UTC PagerDuty experienced a major incident that caused a degradation of event ingestion via the Events API. 

During this period some event submissions were rejected with an HTTP 50X response. 

What Happened

A change to the Events API service was deployed that, over time, severely impacted the main Events API service. The issue affecting Events API service resulted in it prematurely closing the connection from the load balancer before returning response to the client for the majority of requests. Eventually this caused all of the servers in the cluster to be marked as unhealthy by the load balancers. During the incident Events API was responding with HTTP 50X response to the Event API clients for roughly 95% of requests with the remaining 5% being served by a subset of the fleet that did not receive the version with the problem. Ultimately there were almost no healthy servers available to accept events and new requests were rejected by the load balancers. To restore service, we rolled back the relevant changes to the Events API service.

What We Are Doing About This

We are currently addressing multiple contributing factors for this issue. Planned and currently worked on steps are:

  • Monitoring changes We are changing alerting on failed health checks on Events API service to be more robust.
  • Faster deploys We have identified improvements to our deployment strategy for Events API service to serve both emergency and nominal operations better and help with faster rollbacks.
  • Error visibility We are improving our logging for better error visibility in certain scenarios like this. 
  • Deployment health check We’re also changing the way our deployments perform validation to ensure they explicitly fail when they encounter these types of errors.

We will do everything we can to learn from this event and make the improvements necessary to uphold the high standard of availability we have to serve the needs of our customers.

Finally, we’d like to  apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.

Posted Oct 08, 2020 - 18:04 UTC

Resolved
This incident has been resolved.
Posted Oct 02, 2020 - 22:26 UTC
Monitoring
Event ingestion has returned to normal levels and we are monitoring the results
Posted Oct 02, 2020 - 22:23 UTC
Update
We are continuing to work on a fix for this issue.
Posted Oct 02, 2020 - 22:19 UTC
Identified
We have identified a potential fix, which we are currently deploying. We are seeing some initial improvements.
Posted Oct 02, 2020 - 22:18 UTC
Investigating
We are currently experiencing issues with Event ingestion. We're investigating.
Posted Oct 02, 2020 - 21:43 UTC
This incident affected: Events API and Notification Delivery.