On October 18, 2017 20:36 UTC to 22:26 UTC, one of PagerDuty’s datacenters suffered a network degradation. This resulted in delayed notifications to 20% of customers as well as 5xx errors from our events endpoint affecting approximately 15% of customers.
At 20:36 UTC, our monitoring systems detected an increase in TCP retransmission rates as well as network interruptions for both ingress and egress traffic in the affected datacenter. Approximately 10 minutes later, we started our Incident Response.
At 21:15 UTC, our internal networking metrics showed that there were no longer any active networking problems in the affected datacenter. With the network recovered, our backlog of incoming events and outgoing notifications started to recover as well.
At 22:26 UTC, the remaining backlog, which affected 20% of our customers, was fully recovered and all systems were operational.
During the networking event, we noted that some of our production systems as well as some of our internal tooling needed improvement to be resilient against this type of network degradation. While we do test network degradation on a per host basis, we had not been testing this at the datacenter level. For our events endpoint, we will be implementing changes such that an entire datacenter loss does not prevent requests going through to other datacenters. For our internal tooling that prevented us from taking certain recovery actions during the networking event, we will be investing time to run these tools either in a multi-datacenter or failover model.
During the period of recovery, we discovered a bottleneck on one of our databases. We will be looking into vertically scaling this database to ensure that we have adequate capacity to respond as quick as possible when we have backlogs increase upstream.
We greatly apologize for any inconvenience this has caused. Please contact us at firstname.lastname@example.org if you have any questions about this.