On March 23rd, 2018 at 15:40 UTC, the portion of PagerDuty’s inbound integration event processing service that handles inbound email for email integrations suffered a performance degradation that lasted 74 minutes. During this time, email events were delayed for over five minutes, resulting in delayed action in triggering or resolving incidents.
A portion of our event processing cluster experienced network issues. Our email processing service was configured such that it retried processing and transmitting events fewer times before restarting constituent hosts. This resulted in many hosts restarting, compared to other components of the service that were able to retry hosts experiencing network issues enough times to survive the network issues. These continual restart attempts added a delay to the email processing, and ultimately, a backlog of email events.
To recover, PagerDuty engineers temporarily doubled the number of hosts in the email processing service. With this adjustment in place, the backlog of stuck email events could be processed faster, and thus event processing was able to catch up with new event submissions.
We are taking steps to update the hosts to be more resilient to the type of network issue that we experienced, in order to be able to make forward progress when they occur, like our other services do.
Additionally, since most of the time spent recovering was in processing the backlog of events that accumulated during the service degradation, we will be investigating improved methods for processing very large backlogs of events.
We regret if this affected your team’s ability to receive alerts in a timely manner. As always, if you have any questions or concerns, feel free to contact us at firstname.lastname@example.org.