On November 15th, 2022, from 06:55 UTC to 15:05 UTC, the service powering the Visibility Console began encountering frequent timeout errors from one of its own dependent services. This left customers unable to load the Visibility Console or faced an extremely high load time when trying to view the content/dashboard in the US region.
The Visibility Console is powered by a service which itself depends on other PagerDuty services. One of these indirect dependencies was running on a shared node, where another separate service was consuming high amounts of both CPU and network bandwidth. Instances of other services running on this shared node also exhibited elevated response times from their own internal dependencies.
The slow responses from the Visibility Console’s indirect dependency caused the service powering the Visibility Console to return slow responses, leading to the Visibility Console’s unresponsiveness for users.
At 13:10 UTC, PagerDuty began an incident response process after receiving customer reports of the Visibility Console not loading. During the same time range in which the Visibility Console was unable to load, an update to our mobile app increased load on the Visibility Console’s indirect dependency. An earlier incident caused by a library update in another service also increased memory pressure on our shared infrastructure. These simultaneous increases in pressure on our infrastructure delayed our ability to identify and remediate the cause of the Visibility Console’s issues.
At 14:38 UTC, after examining the metrics of the different services involved, responders suspected the slow responses to be the result of an infrastructure issue, and restarted the indirect dependency at 14:50 UTC, and the direct dependency at 15:04 UTC. Restarting both services resolved the incident, and users were able to load the Visibility Console beginning at 15:05 UTC. The incident was closed at 15:32 UTC after responders verified that the Visibility Console continued to load and the service metrics were within normal bounds.
Following this incident, we conducted a thorough post-mortem investigation, which identified the events that contributed to this failure. Our engineering teams have worked diligently to address these findings and ensure that we are protected from such incidents going forward. The corrective actions included the following:
We sincerely apologize for the interruptions with Visibility Console that you or your teams have experienced and the impact that it had. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to firstname.lastname@example.org.