Issue with Listing Services
Incident Report for PagerDuty
Postmortem

Summary

On January 18, 2023, between 15:45 UTC and 20:11 UTC, PagerDuty experienced degradation of the Service Directory and Visibility Console in the US Service Region.

During this time, customers in the US Service Region would have noticed a slower user experience as well as occasional errors when attempting to load the Service Directory. Customers would have also been unable to load the Visibility Console, or faced a longer wait time when trying to view the content/dashboard.

What Happened

Both the Visibility Console and the Service Directory depend on an underlying service containing Technical Service information. This service’s datastore experienced a partial node loss and was unable to automatically restore it. This resulted in more load being placed on the remaining nodes in the cluster and general performance degradation of the service.

At 15:47 UTC, PagerDuty began an incident response process after detecting the increase in failures. At 16:28 UTC, attempts to replace the impacted node began. At 17:49 UTC, a parallel effort was initiated to spin up a new cluster and transition the service over to it. This cluster was synced up with the latest data at 20:14 UTC and was ready to be cut over, had the initial effort to replace the impacted node not been successfully completed at 19:52 UTC. At 20:11 UTC, users were able to load the Service Directory and Visibility Console without issue. The incident was closed at 20:28 UTC after responders verified that the Service Directory and Visibility Console continued to load and the service metrics were within normal bounds.

What We Are Doing About This

Following this incident, we conducted a thorough incident review, which identified the events that contributed to this failure. Our engineering teams have worked diligently to address these findings and ensure that we are protected from such incidents going forward. The corrective actions included the following:

  • We have scaled up the datastore such that a failing node in the future does not have such a negative impact.
  • We are scheduling tests with increased load on our datastore, in order to ensure it is able to withstand increased load in the future under similar circumstances.

We sincerely apologize for the interruptions with the Service Directory and Visibility Console that you or your teams have experienced and the impact that it had. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Jan 30, 2023 - 23:24 UTC

Resolved
We have resolved an incident where PagerDuty customers in the US service region experienced issues with listing services on their account. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Jan 18, 2023 - 20:27 UTC
Monitoring
We are monitoring improvement in an incident with listing services. We have deployed a fix, and we expect systems to continue to improve. We will provide an update within 20 minutes.
Posted Jan 18, 2023 - 20:18 UTC
Update
We are investigating an incident where PagerDuty customers in the US service region are experiencing issues with listing services on their account. Impacted customers may see slowness or timeouts when listing services. We will provide further updates within 20 minutes.
Posted Jan 18, 2023 - 19:52 UTC
Update
We are investigating an incident where PagerDuty customers in the US service region are experiencing issues with listing services on their account. Impacted customers may see slowness or timeouts when listing services. We will provide further updates within 20 minutes.
Posted Jan 18, 2023 - 19:31 UTC
Update
We are investigating an incident where PagerDuty customers in the US service region are experiencing issues with listing services on their account. Impacted customers may see slowness or timeouts when listing services. We will provide further updates within 20 minutes.
Posted Jan 18, 2023 - 19:11 UTC
Update
We are investigating an incident where PagerDuty customers in the US service region are experiencing issues with listing services on their account. Impacted customers may see slowness or timeouts when listing services. We will provide further updates within 20 minutes.
Posted Jan 18, 2023 - 18:47 UTC
Update
We are continuing to investigate an incident where PagerDuty customers in the US service region are experiencing issues with listing services on their account. We will provide further updates within 20 minutes.
Posted Jan 18, 2023 - 18:27 UTC
Identified
We are investigating an incident where PagerDuty customers in the US service region are experiencing issues with listing services on their account. Impacted customers may see slowness or timeouts when listing services. We will provide further updates within 20 minutes.
Posted Jan 18, 2023 - 18:10 UTC
This incident affected: Services (Services (US)).