During a period beginning 20:05 UTC on Tuesday, September 19 and ending at 02:30 UTC Thursday, September 21st PagerDuty experienced an issue that caused delays in delivery of notifications and webhooks.
The issue was reported as two separate events in our communications because at one point we had successfully mitigated the impacts of the issue, which then recurred later. This report covers both periods, stemming from same cause.
During this time, the length of delays fluctuated, peaking at over one hour for some customers, yet periodically returning to within normal delivery times. PagerDuty’s web and mobile apps, and PagerDuty’s REST APIs, were functional during the entire time and remained available.
Degraded performance of one of our Cassandra database clusters caused delays outside tolerance limits to the delivery of notifications and the dispatching of webhooks. The degradation in performance was triggered during the replacement of a failed virtual machine in the cluster. This maintenance was unplanned, as the failure of the host was unexpected.
The procedure used to replace the failed node triggered a chain reaction of load on other nodes in the cluster, which hampered this cluster’s ability to do its primary job of processing notifications.
Specifically, the node was moved to a new location in the Cassandra token ring after being added to the cluster, rather than being inserted into its desired location initially. As a result, it initiated streaming operations with many other nodes in the cluster, which consumed significant computing resources on it and the other nodes. We now believe that the move operation greatly increased the number of nodes involved over the number that would have been involved, had the node been inserted directly into the desired location.
Work was carried out in parallel by two groups of engineers. One group performed emergency measures to attempt to restore the performance of our systems, while a second group prepared and tested a new, scaled-up cluster to replace the degraded cluster.
Our initial reaction to notification times falling out of line with our target was to attempt to heal the Cassandra cluster and restore performance.
The working hypothesis during the issue, supported by data in our dashboards, was that the impacted nodes and the new nodes were synchronizing their data. This operation consumed resources on the affected nodes, degrading performance. Until that operation completed, it would continue to have a negative impact on the performance of the cluster. As such, mitigation strategies focused on expediting the synchronization operation.
The team adjusted the amount of bandwidth allocated to streaming data and reduced the amount allowed for compaction tasks. We also attempted to stop compaction on the affected hosts altogether so that they could devote more resources streaming activities.
We made attempts to isolate the hosts from production traffic so that they could devote their computing power to completing the streaming operations. Specifically, we tried to make sure the struggling nodes were not also acting as coordinators for the cluster.
During the first few hours, we saw what we believed to be positive changes in the health of the cluster. The delivery of notifications resumed flowing within our targets and the metrics on the cluster were trending positively. In hindsight, we probably discounted the role that reduced overall volume was having, as it was getting late in the day across North America. Nevertheless, we were reluctant to make any additional changes that could re-destabilize the cluster while things were trending positively. We declared the issue: resolved.
The next morning when the business day began again in North America, the increased volume again pushed our delivery times beyond our targets. The Cassandra cluster had still not returned to normal load levels despite the previous trend and still exhibited high load on three nodes. We then resumed efforts to heal the cluster.
Throughout this process, load and other important metrics fluctuated. Each change we made required a cycle of applying the change, waiting while we watched for the intended effects, determining if the effects were indeed occurring, and re-evaluating our options. Occasionally a node would return to normal load levels while another node would degrade at the same time. This pattern was repeated a number of times during the course of our remediation efforts, often coinciding with changes we had made.
Another theory emerged, based on metrics in our dashboards, that the nodes were experiencing an above-normal number of garbage collections in the Java Virtual Machine. This would likely indicate memory pressure was too high, leading to thrash as the JVM continually tried to preserve heap size. We restarted one of the problem nodes with increased heap size. This involves the node rejoining the cluster after restarting, which is itself a time-consuming operation. The time spent in this change/restart/measure cycle was a significant factor in how long the issue took to resolve.
After a few rounds of increasing heap sizes and further tweaking of bandwidth settings, the cluster finally seemed to settle down. The long-running synchronization had finally completed. Requests were being serviced within normal time frames. Load was still above normal on a few nodes though streaming activity appeared to have stopped.
At nearly the same time, the new cluster (see below) was deemed ‘ready for use’ so we had to decide whether to close down and remain on the old, problematic cluster, or make the potentially risky call to flip to the new one. At that point, the situation had been ongoing for over 24 hours with only a slight break overnight.
While the primary team worked to restore performance as detailed above, a second team built a new cluster on significantly scaled up hardware.
Since the cluster in question deals with only in-flight notification data and does not act as a long-term data store, it was decided it would be feasible to ‘flip’ over to a new cluster all at once as we wouldn’t need to migrate a large dataset as part of the operation. Additionally, we could use larger hosts and better configuration settings - changes that could not have been easily made to the running cluster without potentially triggering another issue like this or lengthening this one.
Building the new cluster required us to provision 18 new hosts across our 3 data centers, install the required software and place the desired configurations on them. We tested the cluster in our staging environment, including performing ‘practice’ flips in that staging environment, to make sure the process would work. This was an iterative process and required a few rounds of testing and re-testing before we were confident.
Following that, we performed additional checks including comparing the schemas of the two clusters and going line-by-line through the configurations to ensure everything was configured correctly.
Once those checks were completed we declared the new cluster ready.
By approximately 23:30 UTC, Wednesday, September 20, our emergency measures, combined with time, caused notifications to be flowing normally though we continued to carefully monitor the situation at that time.
A number of configuration changes had been made to the cluster. Changes were applied across the cluster when it was safe to do so, but others (in particular the heap size changes) were applied only to the troubled nodes. To continue with the existing cluster would have meant normalizing the configuration across the cluster, restarting processes and in short, risking further destabilization.
Cautiously optimistic, we were not prepared to declare that things were back to normal, only to return to a degraded state later. Given all of that, our team felt that a permanent cutover to the new cluster provided a much better long-term outlook for the stability of the system and decided to proceed with that operation.
The cutover to the new cluster occurred successfully at approximately 02:20 UTC Thursday, September 21 and PagerDuty has been operating normally since that time.
To perform the cutover, we stopped the upstream processes that fed the service associated with this Cassandra cluster. This allowed the process to “drain” its in-flight notifications, while new events were enqueued in the upstream component.
This cutover procedure used caused an approximately nine-minute delay in notifications while the pipeline was stopped, but it enabled all notifications to be delivered. We are not aware of any in-flight notifications were dropped.
Once we had flipped the entire fleet of application hosts to the new cluster, we restarted the processing in the upstream component and notifications began to flow normally again.
Our testing and preparation paid off and the flip to the new cluster worked perfectly out of the gate.
Our new Cassandra cluster has been operating normally since we undertook the efforts described above. It is configured so manual management of the Cassandra token ring is not necessary. This means the events that triggered this service degradation are no longer possible.
As a follow-up, we’ve begun reviewing our Cassandra maintenance procedures, as well as the configuration of our other Cassandra clusters to apply all of the lessons learned across our organization and technology stack.
The new cluster was built with the intention that it permanently replace the previous, troubled cluster and has now done so. It is operating on much larger hardware instances (four times the CPU cores and memory of the previous cluster), which reduces the possibility that it would enter a similar highly loaded state as the one that triggered the past issues.
This matter has been closed since 03:30 UTC Thursday, Sept 21.
We also did an analysis of prior postmortems involving our Cassandra clusters and related issues - we wanted to understand if any systemic issues needed to be addressed within our engineering organization or technology stack.
Based on that analysis, there will be additional work, however, in this postmortem, we’re leaving out discussion of larger scope technical roadmap items, because we want to take the time to be deliberate about any changes.
We understand how important and critical our platform is for our customers. We apologize for the impact this may have had on you and your teams. As always, we stand by our commitment to providing the most reliable and resilient platform in the industry.
September 18, 2017 - 16:13: PagerDuty engineers are notified of failing host and provision a replacement.
September 19, 2017 - 18:58: The on-call engineer completes the final cleanup tasks in the run-book including moving the new node into the location in the Cassandra token-ring previously occupied by the failing node.
September 19, 2017 - 19:30: Degraded performance of the Cassandra cluster as a result of continued high load on three of the 16 nodes begins to impact the timely delivery of notifications. Work begins on mitigation of the effects.
September 20, 2017- 07:03: After observing normal delivery times and the apparent return to health of the Cassandra cluster, we ended response efforts and notified customers that service has returned to normal.
September 20, 2017 - 11:37: Notification delays again cross internal thresholds and customers are notified of delays in notification processing. Mitigation efforts resume.
September 20, 2017 - 13:53: Second response team begins provisioning a new, larger Cassandra cluster.
September 20, 2017 - 21:00: Notifications are being delivered (via the old Cassandra cluster) within our expected targets but the response team remains concerned about the health of the cluster. Observation and monitoring continue.
September 21, 2017 - 02:00: Verification and further testing of the new cluster and cutover plan had been completed and the final decision was made to proceed with the cutover.
September 21, 2017 - 02:21: The cutover is initiated.
September 21, 2017 - 02:30: The cutover is completed.
September 21, 2017 - 02:30 to 03:30: Our teams continue to monitor the health of our systems.
September 21, 2017 - 03:30: We wind down the response and notify our customers that our service had resumed normal operations.