Event Ingestion Issue
Incident Report for PagerDuty
Postmortem

Summary

On March 5, 2020 between 19:00 UTC and 19:52 UTC we experienced an incident related to event processing.

During this time, events sent via the Events API were being accepted but not immediately processed, resulting in delays in incident creation and subsequent notifications of up to 44 minutes. This affected all customers who sent events during this window. There was no impact related to incidents created using the REST API.

What happened

During routine maintenance, two services in the event processing pipeline lost the ability to communicate with one another. As a result, events could not flow successfully through the pipeline. While we continued to accept new events, processing stopped downstream from the ingestion point due to a change in the network addresses of the downstream services that was not recognized by the upstream component. While the correct addresses were present in the upstream components’ configuration files, the automated reload/restart step that was meant to pick up the configuration changes failed.

Our engineers forced a manual restart of the upstream components to pick up the configuration changes and processing resumed. All events that were received during the incident were eventually processed.

What are we doing about this?

The underlying configuration issue was fixed, tested and deployed successfully to production the following day.

During the incident, remediation was slowed by the need to page in another team to execute a command that the service owning team did not have the privileges to run. Permissions changes will be made to allow the service owning team to take such actions themselves in the future. Work is also being undertaken to improve recovery time during a similar incident in the future.

Finally, we will be implementing additional testing processes across engineering around similar configuration changes going forward.

We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Mar 13, 2020 - 21:15 UTC

Resolved
Event processing is now restored. We have fully recovered.
Posted Mar 05, 2020 - 19:51 UTC
Monitoring
We are starting to see signs of recovery, and will be continuing our mitigation actions to fully restore event processing.
Posted Mar 05, 2020 - 19:39 UTC
Identified
We're currently experiencing an issue with event processing, meaning that notifications may be delayed. Our Engineering team is actively implementing a mitigation plan to resolve the issue.
Posted Mar 05, 2020 - 19:33 UTC
This incident affected: Events API.