On May 18th from 17:30 UTC until 23:30 UTC, we experienced increased error rates and degraded performance for Service Directory pages and services depending on Service Search functionality.
A spike in indexing requests to the search cluster powering service directory caused the service degradation, which manifested in slow requests and elevated error rates. One contributing factor discovered during the incident which contributed to the increased indexing load was a mass resolution of old incidents.
It took some time to isolate the problem based on multiple services exhibiting issues.
To remediate the problem, engineers pushed out changes to handle incident updates in larger batches. Engineers also enacted some load-shedding to try and alleviate load on the impacted data store. Lastly, a larger cluster was rolled out that is better able to handle this higher than normal traffic. These changes were completed and pushed to production by 23:30 UTC at which point the issue was resolved.
There will be followup work to reduce internal load to the service that powers the Service Directory so that it can better handle increased incident update rates and general Service Directory traffic.
We are testing our recovery process and adding more monitoring around these services.
We would like to express our sincere regret for this incident.