Event Processing Delays
Incident Report for PagerDuty
Postmortem

Summary

On Aug 24 2017, from 04:11 to 5:10 UTC, our inbound event processing service suffered a performance degradation. During this time, some received events took abnormally long (between 5 and 30 minutes) to translate to actions such as triggering and resolving incidents.

What Happened?

At around 04:00 UTC, in a component of the service responsible for data storage, hosts began running out of memory while processing a significant influx of very large events. This was due to an issue in the JSON parsing library used in this part of our infrastructure.

Ordinarily, new hosts would have been provisioned automatically to compensate. However, the image used for provisioning hosts had been corrupted due to an accidental upload earlier that day, and so hosts could not be re-provisioned. This caused the service to operate at reduced capacity.

After investigation by the engineering team, the storage service was reverted to an earlier stable version, and PagerDuty engineers worked to bring the pipeline back to full speed. By 05:10 UTC, our event processing service had fully recovered.

What Are We Doing About This?

We have evaluated and implemented a replacement for the JSON library that uses a significant amount of memory when decoding larger events. We are also working towards putting additional checks in place around host image deployment and scaling of the affected data storage service. Additionally, we are working on methods to recover faster from errors in this part of our event processing service.

We sincerely apologize for the inconvenience this may have caused. If you have questions or concerns please contact us at support@pagerduty.com.

Posted 11 months ago. Aug 29, 2017 - 15:13 UTC

Resolved
Our event processing service has recovered.
Posted 11 months ago. Aug 24, 2017 - 05:09 UTC
Identified
Our engineers have identified the issue affecting processing of events. They have taken action and our systems are recovering.
Posted 11 months ago. Aug 24, 2017 - 04:59 UTC
Investigating
We are currently experiencing an issue in event processing. Our engineers are aware and taking action to correct this.
Posted 11 months ago. Aug 24, 2017 - 04:46 UTC