Degraded website performance.
Incident Report for PagerDuty
Postmortem

On May 18th from 17:30 UTC until 23:30 UTC, we experienced increased error rates and degraded performance for Service Directory pages and services depending on Service Search functionality.

A spike in indexing requests to the search cluster powering service directory caused the service degradation, which manifested in slow requests and elevated error rates. One contributing factor discovered during the incident which contributed to the increased indexing load was a mass resolution of old incidents.

It took some time to isolate the problem based on multiple services exhibiting issues.

To remediate the problem, engineers pushed out changes to handle incident updates in larger batches. Engineers also enacted some load-shedding to try and alleviate load on the impacted data store. Lastly, a larger cluster was rolled out that is better able to handle this higher than normal traffic. These changes were completed and pushed to production by 23:30 UTC at which point the issue was resolved.

There will be followup work to reduce internal load to the service that powers the Service Directory so that it can better handle increased incident update rates and general Service Directory traffic.

We are testing our recovery process and adding more monitoring around these services.

We would like to express our sincere regret for this incident.

Posted Jun 10, 2021 - 21:15 UTC

Resolved
This incident has been resolved.
Posted May 18, 2021 - 23:27 UTC
Update
We are seeing a return to normal behavior in services page performance.
Posted May 18, 2021 - 23:26 UTC
Update
We are still investigating possible fixes for the slow load times for the services page.
Posted May 18, 2021 - 22:10 UTC
Update
We are still investigating possible fixes for the slow load times for the services page.
Posted May 18, 2021 - 21:01 UTC
Update
The services page is loading slow. We are continuing to investigate potential fixes. Event to notification processing is operating as expected.
Posted May 18, 2021 - 19:36 UTC
Update
We are continuing to investigate potential fixes.
Posted May 18, 2021 - 18:36 UTC
Update
We are continuing to investigate potential fixes.
Posted May 18, 2021 - 18:06 UTC
Investigating
We are experience slow load times for Service Directory pages
Posted May 18, 2021 - 17:41 UTC
This incident affected: Web Application.