Email integration delays
Incident Report for PagerDuty
Postmortem

Summary

Starting on March 13th, 2018 at 19:22 UTC, PagerDuty suffered a degradation in its event processing service. Its direct impact was limited to events for email-based integrations. This resulted in delayed triggering and other actions to incidents if they were performed through email integrations.

What Happened?

A newly-introduced service within the event processing pipeline, which handles email events, was not able to properly handle a particular unexpected error downstream of it. The error was was raised when a specific type of invalid event was submitted through it. This caused a processor to exit, and due to the configuration of the service at the time, the failure eventually led to the whole service being restarted.

Furthermore, our primary deployment tools for this service, as well as some of our system diagnostic tools, were not functioning at the time. To address this, PagerDuty engineers had to utilize alternative methods. This led to further delay in resolution.

The engineering team was able to deploy a fix by 20:29 UTC, and the event processing pipeline returned to a steady state of ingestion and processing across most customers by 20:48 UTC, with only a small number of customers still experiencing issues through 7:32 UTC on March 14th.

What are we doing about this?

We will be taking measures to provide more isolation of email processing service components in order to lessen the impact of similar issues in the future. Additionally, we are working on improvements to our metrics, visibility and monitoring for more effective incident response.

We apologize if this outage affected your team’s ability to receive alerts in a timely manner. As always, if you have any questions or concerns you may contact us at support@pagerduty.com.

Posted 4 months ago. Mar 20, 2018 - 20:30 UTC

Resolved
Our systems have recovered.
Posted 4 months ago. Mar 13, 2018 - 20:45 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted 4 months ago. Mar 13, 2018 - 20:40 UTC
Update
Notification delivery has now been confirmed as fully operational, although incident triggering is still affected.
Posted 4 months ago. Mar 13, 2018 - 20:29 UTC
Update
We are currently experiencing delay in notification delivery in addition to email event processing. Our engineering team is currently taking action.
Posted 4 months ago. Mar 13, 2018 - 20:11 UTC
Investigating
We are currently experiencing delays in event processing for email integrations. All other types of integrations are unaffected.
Posted 4 months ago. Mar 13, 2018 - 19:47 UTC