On August 9, 2019 at 19:00 UTC, PagerDuty had a service degradation that impacted the ability to view Alert Details via the Alerts page and on the Incident Details page. During this time, information about alerts were either delayed or unavailable until the backlog had been fully processed.
The backfill was completed and functionality was fully restored on August 10, 2019 at 10:31 UTC.
After running a simulation in our test environment, we proceeded with a planned cluster upsizing in production. We underestimated the performance impact of upsizing the cluster in production. Once the upsizing process commences, it is uninterruptible. During the operation, only half of the compute resources were available to service customer requests. A secondary cluster was not available to absorb the increased demand for CPU resources.
To remediate the issue, we considered two options in parallel. We began spinning up a new, upsized cluster from a snapshot. This takes time, so we also investigated how we could reduce the impact of the ongoing upsize operation. We throttled the data transfer that the operation was performing, as well as reduced how many processes were performing the data transfer. Both of these changes were not enough to bring this cluster to a healthy state, and we pivoted our efforts towards preparing the new cluster.
While our new cluster was being restored, we began to write incoming alert data into it so that we could redirect traffic to allow customers to see their most recent alerts. In the meantime, we continued the snapshot restore, and backfilled the remaining data.
We’re implementing a new disaster recovery strategy for the service that powers the Alert Details and Incident Details pages. We’re also reevaluating our ElasticSearch cluster’s configuration and shard sizing. We’re still investigating why the backfill process proceeded much more slowly than expected.
We know that our customers rely on PagerDuty to provide up-to-date and accurate information. We apologize for this degradation, and we will do our best to make sure that this does not happen again. For any questions, comments, or concerns, please reach out to firstname.lastname@example.org.