On August 5, beginning at 12:02 UTC, PagerDuty experienced a degradation across multiple services that lasted 38 minutes and caused delays in incident creation, notifications, and incident detail display.
A relatively new piece of our infrastructure, which is responsible for internal communication in some of our data processing pipelines, experienced a networking regression shortly after it was upgraded. This caused it to disconnect, which halted replication and forwarding of messages.
Once the problem had been identified, PagerDuty engineers manually restarted the service, and after a period of time the backlog of messages upstream in the data pipeline was processed successfully.
To be able to more quickly recover from similar incidents, we developed a method to automate the process of restarting the service, and have put monitoring in place for the condition that led to the incident. We have also identified that the networking regression is fixed in a newer version of the datastore, and will be upgrading the affected system.