Issues in Alerts Displaying in Web UI
Incident Report for PagerDuty
Postmortem

Summary

On August 9, 2019 at 19:00 UTC, PagerDuty had a service degradation that impacted the ability to view Alert Details via the Alerts page and on the Incident Details page. During this time, information about alerts were either delayed or unavailable until the backlog had been fully processed.

The backfill was completed and functionality was fully restored on August 10, 2019 at 10:31 UTC.

What happened

After running a simulation in our test environment, we proceeded with a planned cluster upsizing in production. We underestimated the performance impact of upsizing the cluster in production. Once the upsizing process commences, it is uninterruptible. During the operation, only half of the compute resources were available to service customer requests. A secondary cluster was not available to absorb the increased demand for CPU resources.

To remediate the issue, we considered two options in parallel. We began spinning up a new, upsized cluster from a snapshot. This takes time, so we also investigated how we could reduce the impact of the ongoing upsize operation. We throttled the data transfer that the operation was performing, as well as reduced how many processes were performing the data transfer. Both of these changes were not enough to bring this cluster to a healthy state, and we pivoted our efforts towards preparing the new cluster.

While our new cluster was being restored, we began to write incoming alert data into it so that we could redirect traffic to allow customers to see their most recent alerts. In the meantime, we continued the snapshot restore, and backfilled the remaining data.

What are we doing about this?

We’re implementing a new disaster recovery strategy for the service that powers the Alert Details and Incident Details pages. We’re also reevaluating our ElasticSearch cluster’s configuration and shard sizing. We’re still investigating why the backfill process proceeded much more slowly than expected.

We know that our customers rely on PagerDuty to provide up-to-date and accurate information. We apologize for this degradation, and we will do our best to make sure that this does not happen again. For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted Aug 16, 2019 - 20:47 UTC

Resolved
We have finished backfilling the rest of the alert details that were previously impacted. We are in full recovery.
Posted Aug 10, 2019 - 10:30 UTC
Update
As part of our remediation process, we have deployed a fix so all new incoming alert details are now being displayed in the alerts table in the Web UI. We are in the process of backfilling the rest of the alert details that were previously impacted.
Posted Aug 09, 2019 - 22:20 UTC
Update
We are continuing to process backlogged alert details that are used for displaying alerts in our Web UI. We will provide an update once this remediation action is complete.
Posted Aug 09, 2019 - 20:56 UTC
Identified
We have identified the issue, and are currently performing remediation steps in order to restore alerts being displayed in the alerts table in the Web UI. A current workaround for users to access alert information would be to view the Incident Dashboard page. We will provide an update in the next hour.
Posted Aug 09, 2019 - 19:46 UTC
Investigating
We are currently experiencing an issue where alert information being displayed in the Web UI may not be up to date. We are currently investigating this issue. We can confirm that alerts are being ingested into our system, and this is restricted to a display issue.
Posted Aug 09, 2019 - 19:20 UTC