On July 13th, 2020 between 17:02 UTC (10:02PDT, 13:02PM EDT) and 20:07 UTC (13:07 PDT, 16:07EDT) PagerDuty experienced a major incident that rendered the User Search and listing page non-functional. Additionally, users were unable to add responders to incidents or re-assign incidents.
While trying to update some old Elasticsearch clusters, a misconfiguration was applied that required manual removal.During the process of executing the manual delete, the engineer inadvertently removed the live production Elasticsearch cluster. The team was notified almost immediately via monitoring.
The Terraform scripts for the Elasticsearch clusters requiring removal had an old configuration that did not conform to current PagerDuty standards. When upgrading the Terraform configuration to the current standard, the team accidentally mis-configured a setting and this mis-configuration prevented the removal of clusters with scripts. This then resulted in a request to another team to manually remove the clusters. The names of the old and new clusters were very similar and the team mistakenly deleted the live cluster.
We created a new Elasticsearch cluster with a different name and successfully backfilled it as per our process. We re-pointed our service to this new cluster and the functionality was restored.
What are we doing about this?
We’ve identified areas of improvement which will help us ensure that incidents of this nature are less likely, and can be more quickly resolved in the future, such as:
Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to firstname.lastname@example.org with these questions.