On December 11th 2017, from 19:10 UTC to 21:45 UTC, we experienced a degradation in our ability to process events. As a result, incident creation and thus notifications were delayed for some customers. A subset of the delayed events were dropped as well.
We had received a number of events which caused heavy concurrent resource contention in one of our downstream services. As a result, our event processing pipeline began halting periodically in short bursts.
Once we had identified the problem, these events were isolated from the processing pipeline to provide relief from the pressure of events backlog. Unfortunately, a number of other events which were not related to the cause of the issue were put into the same isolation as well. When we had made the decision to drop the isolated events, the benign events were dropped with the problematic events.
We will be adding increased flexibility in our ability to isolate and fail specifically problematic events. Also, in the longer term, we will be working on replacing our downstream services and removing the bottleneck that had lowered our processing throughput.
We would like to reiterate our regret for the service interruption. For any questions, comments, or concerns, please reach out to firstname.lastname@example.org