On March 5, 2020 between 19:00 UTC and 19:52 UTC we experienced an incident related to event processing.
During this time, events sent via the Events API were being accepted but not immediately processed, resulting in delays in incident creation and subsequent notifications of up to 44 minutes. This affected all customers who sent events during this window. There was no impact related to incidents created using the REST API.
During routine maintenance, two services in the event processing pipeline lost the ability to communicate with one another. As a result, events could not flow successfully through the pipeline. While we continued to accept new events, processing stopped downstream from the ingestion point due to a change in the network addresses of the downstream services that was not recognized by the upstream component. While the correct addresses were present in the upstream components’ configuration files, the automated reload/restart step that was meant to pick up the configuration changes failed.
Our engineers forced a manual restart of the upstream components to pick up the configuration changes and processing resumed. All events that were received during the incident were eventually processed.
The underlying configuration issue was fixed, tested and deployed successfully to production the following day.
During the incident, remediation was slowed by the need to page in another team to execute a command that the service owning team did not have the privileges to run. Permissions changes will be made to allow the service owning team to take such actions themselves in the future. Work is also being undertaken to improve recovery time during a similar incident in the future.
Finally, we will be implementing additional testing processes across engineering around similar configuration changes going forward.
We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org.