Issues with Users Service
Incident Report for PagerDuty
Postmortem

Summary

On July 13th, 2020 between 17:02 UTC (10:02PDT, 13:02PM EDT) and 20:07 UTC (13:07 PDT, 16:07EDT) PagerDuty experienced a major incident that  rendered the User Search and listing page non-functional.  Additionally, users were unable to add responders to incidents or re-assign incidents.

What Happened

While trying to update some old Elasticsearch clusters, a misconfiguration was applied that required manual removal.During the process of executing the manual delete, the engineer inadvertently removed the live production Elasticsearch cluster. The team was notified almost immediately via monitoring.

Contributing Factors

The Terraform scripts for the Elasticsearch clusters requiring removal had an old configuration that  did not conform to current PagerDuty standards.  When upgrading the Terraform configuration to the current standard, the team accidentally mis-configured a setting and this mis-configuration  prevented the removal of clusters  with scripts.  This then resulted in a request to another team to manually remove the clusters.  The names of the old and new clusters were very similar and the team mistakenly deleted the live cluster. 

Resolution

We created a new Elasticsearch cluster with a different name and successfully backfilled it as per our process.  We re-pointed our service to this new cluster and the functionality was restored.

What are we doing about this?

We’ve identified areas of improvement which will help us ensure that incidents of this nature are less likely, and can be more quickly resolved in the future, such as:

  1. Make sure that all our resources are up to date with Terraform configuration so that we do not require manual changes in the future.
  2. Make the User Search page, incident reassignment and adding responders to an incident more resilient in the event of another outage to this service.
  3. Investigate creating common infrastructure/configuration for Elasticsearch backups for standardization. 

Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.

Posted Aug 13, 2020 - 22:24 UTC

Resolved
We are fully recovered. The functionality of the Users page, searching for users, requesting individual users as responders, and reassigning incidents to users has all returned to normal.
Posted Jul 13, 2020 - 20:06 UTC
Monitoring
We have deployed a fix and are seeing signs of recovery.
Posted Jul 13, 2020 - 19:48 UTC
Update
We are still pursuing several options for remediation of the issue with accessing the Users page and adding responders.
Posted Jul 13, 2020 - 19:27 UTC
Identified
We have identified the issue and are currently remediating. We expect full restoration of service shortly.
Posted Jul 13, 2020 - 18:05 UTC
Investigating
Our users page and users search is currently down. Individual users cannot be added to responder requests. We are currently investigating a fix.
Posted Jul 13, 2020 - 17:54 UTC
This incident affected: REST API and Web Application.