On January 18, 2023, between 15:45 UTC and 20:11 UTC, PagerDuty experienced degradation of the Service Directory and Visibility Console in the US Service Region.
During this time, customers in the US Service Region would have noticed a slower user experience as well as occasional errors when attempting to load the Service Directory. Customers would have also been unable to load the Visibility Console, or faced a longer wait time when trying to view the content/dashboard.
Both the Visibility Console and the Service Directory depend on an underlying service containing Technical Service information. This service’s datastore experienced a partial node loss and was unable to automatically restore it. This resulted in more load being placed on the remaining nodes in the cluster and general performance degradation of the service.
At 15:47 UTC, PagerDuty began an incident response process after detecting the increase in failures. At 16:28 UTC, attempts to replace the impacted node began. At 17:49 UTC, a parallel effort was initiated to spin up a new cluster and transition the service over to it. This cluster was synced up with the latest data at 20:14 UTC and was ready to be cut over, had the initial effort to replace the impacted node not been successfully completed at 19:52 UTC. At 20:11 UTC, users were able to load the Service Directory and Visibility Console without issue. The incident was closed at 20:28 UTC after responders verified that the Service Directory and Visibility Console continued to load and the service metrics were within normal bounds.
Following this incident, we conducted a thorough incident review, which identified the events that contributed to this failure. Our engineering teams have worked diligently to address these findings and ensure that we are protected from such incidents going forward. The corrective actions included the following:
We sincerely apologize for the interruptions with the Service Directory and Visibility Console that you or your teams have experienced and the impact that it had. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry. If you have any questions, please reach out to firstname.lastname@example.org.