On Thursday, November 29, 2018 from 17:32 UTC until 18:56 UTC PagerDuty experienced an 84 minute incident causing errors for the customers that take advantage of our Advanced Permissions feature. Advanced Permissions allows greater control of the level of access your users have in your PagerDuty account. During this incident no data was exposed to users who should not have access to it as no requests were able to be served. This impacted the PagerDuty Web Application, Mobile Apps, and REST API. Notification delivery and inbound integrations that used the Events API were unaffected.
Initially, an automated health check discovered that a single instance of the Permissions Service cluster was not properly serving traffic. To resolve this failed instance, a redeployment of the service was executed by the on-call engineer across all hosts of the cluster at the same time. After the redeployment, all of the instances of the cluster were unable to connect to their database. The connections were failing due to invalid credentials because the hosts were unable to retrieve their database credentials from our credential manager. This caused the Permissions Service to return errors for all requests.
Due to the issue affecting the Permissions Service, all services that require evaluating permissions to determine user access level began responding with error messages as well. For customers using Advanced Permissions, this applied to user-initiated operations performed through PagerDuty’s web-facing and mobile interfaces. However, acknowledging and resolving incidents via SMS and voice replies were not affected and remained functional.
The Permissions Service in question is set to be replaced in the short term with a rewritten version. Because of this we deferred migrating the component to our preferred deployment and container-hosting stack which is already in use for the majority of our production infrastructure. Along with other benefits unrelated to this incident, this stack automates the steps necessary to connect to our credential manager and enforces canary deployments which halts deployments across the cluster if the first deployment fails or fails health checks once deployed.
Since the migration of this component to our modern stack was deferred, it retained somewhat of a ‘special snowflake’ status within our infrastructure. Because of that status, this component was missed when necessary security certificates were updated for our secure credential store. While running, each host already had the database credentials in memory and were able to operate without issue until restarted as part of the deployment action taken to remediate the initial failure. Once the restart occurred, the credentials were purged from memory and, lacking correct certificates, the hosts were unable to retrieve them again from the secure store.
Due to the nature of the failure, we were unable to roll back to a known-good state and restore service quickly. The remediation time was slightly extended as it was not immediately apparent to the responders that the problem was in the connection to the secure credential store and initial focus was on the database user and grants itself. Once we shifted focus to the credential store, additional time was spent understanding the nature of the connectivity issue.
We were ultimately able to provide valid credentials to the Permissions Service hosts and restore database connectivity, returning normal service to all users.
Since the incident, we have completed migration of the legacy Permissions Service to our modern deployment and hosting stack thus removing its ‘special snowflake’ status. This automated and restored its connection to the secure credential store in a way that will prevent the same issue from happening again. The deployment process also enforces canary deployments so that a failure in one host will stop deployments to the others. We are also wrapping up final testing on the replacement service which was architected to better support caching of permissions settings by its client applications so that failures in the Permissions component will not result in failure of the overall request. The new component will also include circuit breakers and fallbacks to provide access to a subset of Web and Mobile functionality sufficient for incident response even when full permissions data is unavailable.
We understand how important and critical our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry.