On September 13th at 7:40 UTC, we experienced an issue with Notification Delivery and Event Ingestion. Approximately 10% of notifications were delayed and our REST API intermittently returned 5XX error responses for 2% of requests.
PagerDuty historically ran on a cluster of three Galera Master databases. It was a good architectural choice for small volumes of traffic. As our traffic grew significantly over the past few years, it became clear to us that we would need to move away from that database topology to enable PagerDuty to scale. Earlier this year, we started working on a migration plan for our databases.
As part of the migration, we were planning to move all traffic to a single Master Database before the final migration step. From our early benchmarks, we believed that the existing Master Database had the necessary capacity and would be able to handle extra workload.
Our DBAs started the last preparation procedure necessary before migrating our new Database Topology. Shortly after we completed our last preparation procedure, we observed a sharp increase in CPU utilization and database response times on the Master Database. These anomalies were initially deemed to be the result of database warm-up. However, our team observed it was taking longer than usual to normalize, therefore we proactively started our internal incident response to mitigate the issue.
We reverted the migration procedure to restore capacity. We were able to isolate the cause of the issue to transport encryption tasks being assigned to a single CPU Core that created a bottleneck.
We are taking a few steps to prevent this from reoccurring. We will swap our transport encryption subsystem for a more efficient one for the database servers. We are currently running benchmarks, and consulting with internal teams to identify the best path forward.
In parallel, our developers are hard at work to eliminate unnecessary network throughput and load from our Galera Masters as a risk-mitigation strategy as well.