Issue with visibility console
Incident Report for PagerDuty
Postmortem

Overview

On November 15th, 2022, from 06:55 UTC to 15:05 UTC, the service powering the Visibility Console began encountering frequent timeout errors from one of its own dependent services. This left customers unable to load the Visibility Console or faced an extremely high load time when trying to view the content/dashboard in the US region.

What Happened

The Visibility Console is powered by a service which itself depends on other PagerDuty services. One of these indirect dependencies was running on a shared node, where another separate service was consuming high amounts of both CPU and network bandwidth. Instances of other services running on this shared node also exhibited elevated response times from their own internal dependencies.

The slow responses from the Visibility Console’s indirect dependency caused the service powering the Visibility Console to return slow responses, leading to the Visibility Console’s unresponsiveness for users.

At 13:10 UTC, PagerDuty began an incident response process after receiving customer reports of the Visibility Console not loading. During the same time range in which the Visibility Console was unable to load, an update to our mobile app increased load on the Visibility Console’s indirect dependency. An earlier incident caused by a library update in another service also increased memory pressure on our shared infrastructure. These simultaneous increases in pressure on our infrastructure delayed our ability to identify and remediate the cause of the Visibility Console’s issues.

At 14:38 UTC, after examining the metrics of the different services involved, responders suspected the slow responses to be the result of an infrastructure issue, and restarted the indirect dependency at 14:50 UTC, and the direct dependency at 15:04 UTC. Restarting both services resolved the incident, and users were able to load the Visibility Console beginning at 15:05 UTC. The incident was closed at 15:32 UTC after responders verified that the Visibility Console continued to load and the service metrics were within normal bounds.

What We Are Doing About This

Following this incident, we conducted a thorough post-mortem investigation, which identified the events that contributed to this failure. Our engineering teams have worked diligently to address these findings and ensure that we are protected from such incidents going forward. The corrective actions included the following:

  • We have removed the node with anomalous network performance from our infrastructure.
  • We are scheduling tests with increased load on our underlying hosts, in order to identify other areas where a service is not well isolated from increased infrastructure load.
  • We are improving the monitoring for both services involved in the incident, so that we can detect these issues before they result in outages of the Visibility Console.
  • We are documenting that when a service is seeing slow response times from one of its dependent services, but the dependent service’s response times to other callers are within expected bounds, this indicates an infrastructure issue, and rescheduling the services to other nodes is likely to be an effective remediation step.

We sincerely apologize for the interruptions with Visibility Console that you or your teams have experienced and the impact that it had. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Nov 22, 2022 - 22:18 UTC

Resolved
We have resolved an incident where all PagerDuty customers in the US service region experienced issues with the visibility console loading content slowly for the incidents and services modules. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Nov 15, 2022 - 15:28 UTC
Investigating
We are monitoring improvement in an incident where the visibility console is loading content slowly for the incidents and services modules in the US service region. We have deployed a fix, and we expect systems to continue to improve. We will provide further updates within 20 minutes.
Posted Nov 15, 2022 - 15:16 UTC
Update
We are still continuing to investigate an incident where the visibility console is loading content slowly for the incidents and services modules in the US service region. We will provide further updates within 20 minutes.
Posted Nov 15, 2022 - 14:52 UTC
Update
We are still continuing to investigate an incident where the visibility console is loading content slowly for the incidents and services modules in the US service region. We will provide further updates within 20 minutes.
Posted Nov 15, 2022 - 14:30 UTC
Update
We are still continuing to investigate an incident where the visibility console is loading content slowly for the incidents and services modules in the US service region. We will provide further updates within 20 minutes.
Posted Nov 15, 2022 - 14:15 UTC
Update
We are continuing to investigate an incident where the visibility console is loading content slowly for the incidents and services modules in the US service region. We will provide further updates within 20 minutes.
Posted Nov 15, 2022 - 13:57 UTC
Identified
We are investigating an incident where the content of the visibility console in the US service region is slow in loading. Impacted customers may see delays in loading the content of the incidents and services modules. We will provide further updates within 20 minutes.
Posted Nov 15, 2022 - 13:42 UTC
Investigating
We are investigating potential issues with the visibility console not showing any content. On confirmation, we will update you with further impact and severity within 15 minutes.
Posted Nov 15, 2022 - 13:24 UTC
This incident affected: Web Application (Web Application (US)).