Issues affecting Events v1 API
Incident Report for PagerDuty
Postmortem

Summary

Between 19:05 UTC and 20:13 UTC on April 23rd, PagerDuty experienced an incident that caused event ingestion to operate in a degraded state.

During this period some invalid event submissions were rejected with an HTTP 500 response when they should have been rejected with an HTTP 400 response.

What Happened

A change to the Events API service was deployed that caused invalid V1 generic events to be rejected at an earlier stage than they should have been. As a result, some event rejection responses indicated an HTTP 500 response code rather than an HTTP 400 response code.

To restore the expected behavior, the change to the Events API service was rolled back.

What We Are Doing About This

In addition to rolling back the offending change, we are working on addressing the following factors that contributed to this issue:

  • Improved monitoring: We are improving the Events API system health checks so that they better capture true system health, and we are adding additional monitoring to alert us sooner to unexpected system behavior so that we can roll back any pending changes.
  • Additional testing: We are adding additional test coverage around our API contract to cover more cases in order to ensure that the Events API does not produce an unexpected response.

We will do everything we can to learn from this event and make the improvements necessary to uphold the high standard of availability we have to serve the needs of our customers.

Finally, we’d like to apologize for any impact this had on our customers. If you have any further questions, please contact support@pagerduty.com.

Posted May 04, 2021 - 22:33 UTC

Resolved
We have deployed a fix and have fully recovered.
Posted Apr 23, 2021 - 20:16 UTC
Identified
We are aware of issues affecting some Events v1 API ingestion and we are currently working on a resolution.
Posted Apr 23, 2021 - 20:11 UTC
This incident affected: Events API (Events API (US)).