Email Integrations failing to create incidents
Incident Report for PagerDuty
Postmortem

Summary

On May 24th, between 1 PM UTC and 3:35 PM UTC, PagerDuty experienced degradation in processing email events in both the US and EU regions. We stopped processing email events briefly during this window, preventing emails from triggering incidents.

The incident was a direct result of changes that we shipped to one of the critical services in our event ingestion pipeline. Since calls to this critical service happen before events are enqueued, this resulted in failed email events with no option of being retried. During this time, customers could not trigger any incidents on our platform by email. A rollback of the change was immediately kicked-off, and by the end of the rollout, we had completely recovered from the incident.

What Happened

As part of our ongoing efforts to make the event ingestion pipeline at PagerDuty more resilient to event storms, we've been making changes to rate-limit customer accounts and routing keys more effectively. On the day of the incident, we shipped a change that would validate incoming events' routing keys before running the rate-limit checks and subsequently accepting the events. The shipped validation logic did not correctly validate email routing keys. As a result, we started seeing failures in our pipeline processing email events. The email events were dropped because the service couldn't establish the validity of the email routing keys.

After receiving some customer reports and verifying them against our monitoring tools, we quickly established the connection between the service rollout and the incident. Our engineers immediately kicked off our rollback procedures to revert the build and go back to the last stable version of the service. After the rollback was complete, email event processing resumed, and we were in full recovery.

What We Are Doing About This

Following the incident, our teams conducted a thorough investigation into the factors leading up to the incident and have identified several action items for us to undertake to ensure incidents like these don't happen in the future. The action items include the following:

  • Added additional monitoring to alert us of any anomalies in processing email events
  • Enhanced our test suite for the services in question
  • Investing in improving our automated canary analysis checks to prevent faulty builds from hitting production
  • Tweaking our rollback configurations for the services in question to allow for a faster rollback of faulty deployments

We apologize for our failure to process these events and the impact on you and your teams. As always, we stand by our commitment to providing the industry's most reliable and resilient platform. If you have any questions, please reach out to support@pagerduty.com

Posted Jun 01, 2022 - 12:53 UTC

Resolved
This incident has been resolved.
Posted May 24, 2022 - 15:35 UTC
Update
We continue to roll out our fix while noticing a steady recovery in email processing.
Posted May 24, 2022 - 15:32 UTC
Monitoring
We have identified the issue and a fix is underway where we are seeing signs of recovery.
Posted May 24, 2022 - 15:15 UTC
Investigating
We are investigating potential issues where email integrations are not creating incidents within PagerDuty.
Posted May 24, 2022 - 15:05 UTC
This incident affected: Notification Delivery (Notification Delivery (US), Notification Delivery (EU)).