Between February 9th, 2018 06:44 UTC and February 10th, 2018 04:49 UTC, PagerDuty experienced two separate, but related, service degradations in the delivery of all types of notifications (incident notifications, on-call handoff notifications, responder requests and subscriber updates).
The first incident lasted for approximately three hours. The second incident did not occur until 12 hours after the first one; during the time between the first and second incident, there were no evident delays in notification delivery. The second incident lasted for approximately one hour. The rest of the PagerDuty platform was unaffected, and was functional throughout the duration of the timeline mentioned above.
On February 8th, 2018, PagerDuty initiated a procedure for upgrading the Cassandra cluster for the backend service responsible for delivering notifications to our customers. This action was part of a larger program to move all of our Cassandra clusters onto a current version. Prior to undertaking this upgrade we had successfully upgraded a number of our other Cassandra clusters supporting different parts of our platform. Just under half of our deployed fleet of Cassandra nodes had completed the maintenance without customer impact or downtime. All upgrades followed the same procedure. The procedure necessitated a two-stage process, starting with an upgrade of the entire cluster to Cassandra 2.0 and then following up with an upgrade to Cassandra 2.2. On the cluster in question, and prior to the incident, the first stage had completed successfully. The team continued to monitor the health of the cluster and after a period of observation without incident, the second stage was scheduled for the next day.
Prior to the second stage upgrade, we observed elevating CPU usage across the Cassandra cluster. As the usage increased, both reads and writes to the cluster began to slow down, eventually causing delays in notification processing and delivery. On February 9th, 2018 at 06:44 UTC, the team immediately jumped into the incident response to triage, assess and reduce the service degradation. We deployed load-shedding techniques to reduce the pressure on the cluster to free up resources for notification delivery. Eventually the cluster was able to keep up and we had (fully) resolved the incident by 11:30 UTC.
Although the incident had been resolved, and were delivering notifications to our customers as expected, the Cassandra cluster was still experiencing high CPU load. We engaged with our Cassandra consultants, as well with our internal database specialist. Based on their advice and evidence gathered in prior upgrades to our other Cassandra clusters, we believed the problem was caused by something specific to Cassandra 2.0 and that moving to Cassandra 2.2 would return the load to normal. With that in mind, the team decided to move forward with the upgrades.
On February 10th at 01:00 UTC, the team commenced with the second stage of the Cassandra upgrade procedure, upgrading from 2.0 to 2.2. We started by upgrading one node completely and running the internal data migration step before upgrading the remainder of the cluster. The software upgrade took a matter of minutes, however upgrading the internal data structures took longer than estimated. The decision made at that time was to let the migration step continue overnight, and the team would reconvene with the cluster upgrade the next morning.
Before wrapping things up for the night, a decision was made to re-enable On-Call Handoff notifications which had been disabled during the first incident to reduce load on the cluster in question. Although the cluster was still experiencing higher than normal load, the team believed that the application could handle the additional traffic. This turned out to be incorrect and on February 10th, 2018 at 03:46 UTC, the sudden increase in load caused pressure on the Cassandra cluster, and was responsible for the second instance of service degradation in the delivery of notifications. We had fully resolved this incident by 04:49 UTC.
There were many remediation steps that were discussed during the first incident. There was some evidence that the AWS US-West-2 region experienced a networking blip around the beginning of the incident and we believed it may have been a contributing factor. Once it had subsided, we restarted nodes in the cluster to see if that would bring them back to normal load levels. The application responsible for notification delivery will normally retry to send failed notifications within a 2 hour interval; the team investigated as to whether reducing that interval 15 minutes would reduce load on the cluster as well. The team employed a number of load-shedding techniques from our Cassandra runbooks on the nodes experiencing the highest load factors as well.
After enabling the delivery of On-Call Handoff notifications before the second incident, and noticing the huge spike in traffic, the team immediately disabled them again. This prevented the delivery of new notifications; the backlog of 7,000 notifications within the timeframe of enabling and disabling said notifications were tasked to be sent out. We decided against load shedding (i.e. dropping those 7,000 On-Call Handoff notifications), and that the system should be able to burn through them on its own.
Throughout the first incident, the incident response team tried various maneuvers for improving the CPU usage of the Cassandra cluster, and resolving the service degradation around notification delivery. Many of these involved changing application level and Cassandra level configs based on previous experiences. We also disabled the delivery of On-Call Handoff notifications to decrease the amount of inflight notifications.
As noted before, our notification delivery services will attempt to resend a failed notification for up to 2 hours. Around 8:44 UTC on Feb 9th (during the first incident), these services started shedding load themselves, as they stopped retrying failing notifications older than 2 hours. At this instance we noticed that the cluster began to show signs of recovery. It was able to churn through the smaller backlog much easier than before. We were in full recovery (delivering notifications as quickly as before) within the next 30 minutes.
As highlighted in the previous section, the resolution of the second incident involved disabling On-Call Handoff notifications once more. This prevented the delivery of new On-Call Handoff notifications, so that the services dedicated to notification delivery could focus on processing incident notifications. We let the service process its backlog, which took approximately 40 minutes.
After resolving the second incident, the team continued with updating the cluster from 2.0 to 2.2 outside of business hours. As expected, the first upgraded Cassandra node showed great signs of improvement on all levels. Upgrading the rest of the cluster showed load reduced significantly across the cluster.. This phase of the upgrade was completed without incident. Following the upgrade, the team also vertically scaled the cluster, to provide further CPU usage improvements, as well as headroom for future work.
We understand how important and critical our platform is for our customers. We apologize for the impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. For any questions, comments, or concerns, please reach out to email@example.com.
All times noted below are in UTC, unless stated otherwise
February 8th, 2018
20:15: The team began upgrade the Cassandra cluster to 2.0
20:55: The software upgrade was applied to all 8 nodes in the cluster. The cluster was now running Cassandra 2.0
22:00: Rolling internal data migration of each node (one at a time) had commenced
February 9th, 2018
03:26: Cluster-wide completion of the internal data migration
06:44 Kicked off incident response regarding the service degradation of notification delivery
07:23: Rolling restart of the upgraded Cassandra cluster, starting with all nodes in Microsoft Azure, followed by the US-West-1 and US-West-2 regions
08:00: Noticing metrics starting to trend in the right direction. First signs of slight recovery, however delivery of notifications is still degraded. Team continues to apply load shedding techniques while ensuring notifications remain queued for delivery
08:10: Responders make a decision to stop delivering On-Call Handoff notifications to customers.
09:30: Delays are still ongoing, but have decreased to approximately 10 minutes (i.e. if an incident was triggered, worst case you would receive your notification 10 minutes afterwards).
10:03: We have (almost) recovered from the delays in the delivery of notifications. On-Call Handoff notifications are still disabled. The team continues to monitor the system to make sure things we have recovered completely.
11:37: The incident has been marked as resolved
February 10th, 2018
01:00: The team reconvenes and continues to upgrade the Cassandra cluster from 2.0 to 2.2. It starts with the one node out of the 8-node cluster.
01:30: The upgraded node is now on 2.2, and the cluster looks healthy. We continue with upgrading the SStables on that same node
03:32: Team realizes the upgrading the SStables will take a very long time (longer than estimated). Responders and DBA believe it best to leave the upgrade running overnight. Before leaving, On-Call Handoff notifications were re-enabled as there was consensus that the cluster could deal with the added load.
03:46: Team realizes that the Cassandra cluster, in fact, could not handle the added load, as we jump into service degradation around notification delivery again.
03:54: We disable On-Call Handoff notifications again, and let the applications consume through the added backlog.
04:16: The incident has been marked as resolved.
February 11th, 2018
17:11: Commenced upgrading the other 7 nodes in the cluster from 2.0 to 2.2
17:52: The software upgrade was applied to the rest of the cluster. All nodes are now on Cassandra 2.2. Load is significantly reduced across the cluster..
18:15: The team decided to re-enable On-Call Handoff notifications. The system handles the additional backlog without incident.
18:25: We begin the internal data migration to complete the upgrade.
February 12th, 2018