On January 25th, 2022 between 21:39 and 23:03 UTC PagerDuty experienced a SEV-1 outage in the US service region. Between 21:39 and 21:59 UTC, US customers could have experienced 5xx errors when trying to access our Web, Mobile and API platforms. Event ingestion was interrupted during the aforementioned timeframe and 5% of notifications were sent more than 5 minutes after event ingestion. Some EU customers may have experienced issues attempting to authorize OAuth applications such as the PagerDuty Mobile apps. Previously authorized applications continued to function as expected. Once the issue was identified (13 minutes after the issue began, 9 minutes after the major incident was called), the change was reverted and signs of recovery became immediately apparent. Event ingestion and notification delivery resumed although some events were dropped due to rate limiting as systems worked through their backlog. The incident call continued in order to triage lingering issues from 22:00 to 23:03 UTC when full recovery was declared.
In an effort to improve scalability and isolate a portion of our product for fault tolerance, an application was migrated to a separate, isolated infrastructure that required network peering with our production environment. On January 25th at 21:39 UTC a routing change was deployed to enable this peering which inadvertently redirected the majority of production network traffic to the wrong internal network.
Once the issue was triaged and identified the aforementioned change was immediately reverted and the correct route restored. The public status page was updated 12 minutes after detection of the outage. Core PagerDuty functionality was restored immediately after the change was reverted at 21:59 UTC but full recovery was progressive while services caught up on backlogged requests.
At PagerDuty we strongly believe in the full-service ownership model of autonomy for development teams. As part of this, we allow teams to unblock themselves by making changes to other teams’ areas of ownership in a controlled fashion (also known as the “away team” model). This model applies to the ability to provision and manage portions of our infrastructure (such as network connectivity) for new and existing services. While we have leaned toward developer autonomy and speed to customer value as a baseline philosophy, we realize that portions of our infrastructure (like core networking) need to be treated with heightened sensitivity and we will no longer allow other teams to make changes to the module that caused the outage. This is how we are strengthening controls on any modifications to core networking functionality and configuration to ensure that changes to key infrastructure components are rigorously reviewed and tested by SMEs prior to deployment. We also realized that, as a newly provisioned service, some of the defaults we established were not sufficient to prevent this change from having customer impact. We will be implementing safeguards to make this type of change safer to deploy anytime we create a new service.
For any questions, comments, or concerns, please reach out to firstname.lastname@example.org.