On May 24th, between 1 PM UTC and 3:35 PM UTC, PagerDuty experienced degradation in processing email events in both the US and EU regions. We stopped processing email events briefly during this window, preventing emails from triggering incidents.
The incident was a direct result of changes that we shipped to one of the critical services in our event ingestion pipeline. Since calls to this critical service happen before events are enqueued, this resulted in failed email events with no option of being retried. During this time, customers could not trigger any incidents on our platform by email. A rollback of the change was immediately kicked-off, and by the end of the rollout, we had completely recovered from the incident.
As part of our ongoing efforts to make the event ingestion pipeline at PagerDuty more resilient to event storms, we've been making changes to rate-limit customer accounts and routing keys more effectively. On the day of the incident, we shipped a change that would validate incoming events' routing keys before running the rate-limit checks and subsequently accepting the events. The shipped validation logic did not correctly validate email routing keys. As a result, we started seeing failures in our pipeline processing email events. The email events were dropped because the service couldn't establish the validity of the email routing keys.
After receiving some customer reports and verifying them against our monitoring tools, we quickly established the connection between the service rollout and the incident. Our engineers immediately kicked off our rollback procedures to revert the build and go back to the last stable version of the service. After the rollback was complete, email event processing resumed, and we were in full recovery.
Following the incident, our teams conducted a thorough investigation into the factors leading up to the incident and have identified several action items for us to undertake to ensure incidents like these don't happen in the future. The action items include the following:
We apologize for our failure to process these events and the impact on you and your teams. As always, we stand by our commitment to providing the industry's most reliable and resilient platform. If you have any questions, please reach out to firstname.lastname@example.org