Notification Delay
Incident Report for PagerDuty
Postmortem

Summary

On September 13th at 7:40 UTC, we experienced an issue with Notification Delivery and Event Ingestion. Approximately 10% of notifications were delayed and our REST API intermittently returned 5XX error responses for 2% of requests.

What Happened?

PagerDuty historically ran on a cluster of three Galera Master databases. It was a good architectural choice for small volumes of traffic. As our traffic grew significantly over the past few years, it became clear to us that we would need to move away from that database topology to enable PagerDuty to scale. Earlier this year, we started working on a migration plan for our databases.

As part of the migration, we were planning to move all traffic to a single Master Database before the final migration step. From our early benchmarks, we believed that the existing Master Database had the necessary capacity and would be able to handle extra workload.

Our DBAs started the last preparation procedure necessary before migrating our new Database Topology. Shortly after we completed our last preparation procedure, we observed a sharp increase in CPU utilization and database response times on the Master Database. These anomalies were initially deemed to be the result of database warm-up. However, our team observed it was taking longer than usual to normalize, therefore we proactively started our internal incident response to mitigate the issue.

We reverted the migration procedure to restore capacity. We were able to isolate the cause of the issue to transport encryption tasks being assigned to a single CPU Core that created a bottleneck.

What Are We Doing About This?

We are taking a few steps to prevent this from reoccurring. We will swap our transport encryption subsystem for a more efficient one for the database servers. We are currently running benchmarks, and consulting with internal teams to identify the best path forward.

In parallel, our developers are hard at work to eliminate unnecessary network throughput and load from our Galera Masters as a risk-mitigation strategy as well.

Posted about 1 month ago. Oct 11, 2018 - 20:52 UTC

Resolved
Our engineers have confirmed this issue has been resolved at this time.
Posted about 2 months ago. Sep 13, 2018 - 21:04 UTC
Monitoring
We have deployed a remediation and are monitoring the issue to ensure full recovery.
Posted about 2 months ago. Sep 13, 2018 - 20:51 UTC
Identified
We have identified a root cause and are currently working to remediate the issue.
Posted about 2 months ago. Sep 13, 2018 - 20:30 UTC
Investigating
We are currently experiencing small delays in notification delivery as well as intermittent errors from our API. Our engineering teams are actively investigating the root cause of the issue.
Posted about 2 months ago. Sep 13, 2018 - 20:23 UTC
This incident affected: Notification Delivery.