On June 8th, between 6:00 PM UTC and 9:00 PM UTC, PagerDuty experienced an incident with Google Single Sign-On (SSO) in both the US and EU regions. For much of this window, customers who use Google SSO to sign in to PagerDuty were unable to do so. The incident was discovered to be a direct result of a code upgrade we had just performed. A rollback of the change was immediately kicked off, and by the end of the rollout, we had completely recovered from the incident.
As part of our ongoing efforts to make sure our code base is up to date, we have been performing upgrades to parts of our system that are using older technology. On the day of the incident, we shipped an upgrade that, unbeknownst to us, affected how Google SSO worked. We started receiving customer reports about being unable to sign in to PagerDuty using Google SSO and established that our recent change had caused the issue. Once the code in question was reverted, we were able to verify that signing in to PagerDuty using Google SSO no longer had any issues.
Following the incident, our teams conducted a thorough investigation into the factors leading up to the incident and have identified several action items for us to undertake to ensure incidents like these don't happen in the future. The action items include the following:
We apologize for our failure and the impact on you and your teams. As always, we stand by our commitment to providing the industry's most reliable and resilient platform. If you have any questions, please reach out to email@example.com