On February 27th beginning at 4:24 AM UTC, the Events API’s data processing pipeline slowed down due to an issue in one of its component microservices. Our engineers worked to identify the cause, and final measures were taken to prevent it from recurring at 6:20 AM UTC. The event processing pipeline fully recovered at 6:27 AM UTC.
During the incident, as much as 8.0% of notifications were significantly delayed due to the delay in triggering incidents, and approximately 0.5% of the events ingested during this period were dropped as a side effect of the necessary remedial actions. This incident only affected service-level integrations that use the Events API; other integrations such as email-based and Global Event Routing integrations were unaffected.
The microservice in our event processing pipeline that is responsible for evaluating service-level event rules experienced high CPU load. This was caused by a bug that manifested when processing events that had a certain unexpected format. Our engineers worked to keep the pipeline flowing while searching for the problematic events. A fraction of the traffic was shunted to an offline queue for reprocessing at a later time during the initial investigation.
After pipeline ingestion was believed to have been isolated from the problematic events, our engineers attempted to re-process the deferred events. However, the presence of additional problematic events in the offline queue resulted in a resurgence of the original issue. At the time, internal tooling was incapable of filtering out the problematic events, and the decision was made to drop the remaining deferred events to prevent further impact to the service.
As part of our ongoing commitment to improve the operational resilience of our Events API, we are decommissioning the system that was affected by this issue, in order to replace it with our newer and more general-purpose event automation rule service. The newer service has been extended to allow it to perform the same processing operations as the older rule evaluation service as well as global event rules and routing. Customers should not experience any functional difference, although our processing pipeline as a whole should no longer be affected by the previous issue.
Additionally, we are making improvements to our operational tooling to allow more fine-grained control over the reprocessing of events from the offline queue, as well as giving us the ability to more quickly identify the scope of customer impact should it again be deemed necessary to drop any events in this queue.
We understand how important and critical our platform is for our customers, and stand by our commitment to providing the most reliable and resilient platform in the industry. We regret the impact that this incident may have had on you and your organization.