Starting on March 13th, 2018 at 19:22 UTC, PagerDuty suffered a degradation in its event processing service. Its direct impact was limited to events for email-based integrations. This resulted in delayed triggering and other actions to incidents if they were performed through email integrations.
A newly-introduced service within the event processing pipeline, which handles email events, was not able to properly handle a particular unexpected error downstream of it. The error was was raised when a specific type of invalid event was submitted through it. This caused a processor to exit, and due to the configuration of the service at the time, the failure eventually led to the whole service being restarted.
Furthermore, our primary deployment tools for this service, as well as some of our system diagnostic tools, were not functioning at the time. To address this, PagerDuty engineers had to utilize alternative methods. This led to further delay in resolution.
The engineering team was able to deploy a fix by 20:29 UTC, and the event processing pipeline returned to a steady state of ingestion and processing across most customers by 20:48 UTC, with only a small number of customers still experiencing issues through 7:32 UTC on March 14th.
We will be taking measures to provide more isolation of email processing service components in order to lessen the impact of similar issues in the future. Additionally, we are working on improvements to our metrics, visibility and monitoring for more effective incident response.
We apologize if this outage affected your team’s ability to receive alerts in a timely manner. As always, if you have any questions or concerns you may contact us at firstname.lastname@example.org.