On Aug 24 2017, from 04:11 to 5:10 UTC, our inbound event processing service suffered a performance degradation. During this time, some received events took abnormally long (between 5 and 30 minutes) to translate to actions such as triggering and resolving incidents.
At around 04:00 UTC, in a component of the service responsible for data storage, hosts began running out of memory while processing a significant influx of very large events. This was due to an issue in the JSON parsing library used in this part of our infrastructure.
Ordinarily, new hosts would have been provisioned automatically to compensate. However, the image used for provisioning hosts had been corrupted due to an accidental upload earlier that day, and so hosts could not be re-provisioned. This caused the service to operate at reduced capacity.
After investigation by the engineering team, the storage service was reverted to an earlier stable version, and PagerDuty engineers worked to bring the pipeline back to full speed. By 05:10 UTC, our event processing service had fully recovered.
We have evaluated and implemented a replacement for the JSON library that uses a significant amount of memory when decoding larger events. We are also working towards putting additional checks in place around host image deployment and scaling of the affected data storage service. Additionally, we are working on methods to recover faster from errors in this part of our event processing service.
We sincerely apologize for the inconvenience this may have caused. If you have questions or concerns please contact us at firstname.lastname@example.org.