On June 2nd, 2020 between 17:17 UTC (13:17 EDT/ 10:17 PDT) and 20:20 UTC (16:20 EDT/ 13:20 PDT) PagerDuty experienced an incident which impacted event processing, notification processing, and status updates.
During this time, some events sent via the Events API were accepted, but were not being processed immediately which resulted in delays for incident creation and subsequent notifications. 23.8% of notifications took more than 5 minutes to send and 36.3% of incoming events were throttled. Web and Mobile traffic also saw elevated error rates, peaking at 6%, and had an average of 0.08%.
Behind the scenes at PagerDuty we’ve been working diligently to upgrade our MySQL database infrastructure in order to provide greater scalability and throughput to our customers. Since February, we’ve successfully upgraded dozens of our database clusters without issue.
On June 2nd, a change was introduced to upgrade the final primary database from the previous version of MySQL to an updated version. Once this change was completed, our monitoring systems indicated elevated response times and error rates for our event processing systems. While some response time differences were expected as the new database warmed up, the trend that we were seeing did not match any of the heuristics that we had gathered in our testing.
The engineering teams involved in the incident worked to stabilize the system with various remediation actions such as rolling restarts of client applications/services and adjusting capacity of the applications/services which depend on these databases. We saw short-lived improvements of these services from our actions, that caused us to prematurely update our status page saying that the issue had been resolved, but the system performance was not back to expected levels and the degraded performance persisted.
After performing these resets and gathering enough diagnostic information, we made the decision to rollback the database change and return to a known good state using the previous version.
The rollback of the change immediately stabilized our event processing systems. Events and Notifications that were backlogged were processed within 30 minutes.
After troubleshooting the possible contributing factors for the degraded performance, we decided to rollback and we performed a failover to the standby database which was running the previous version of MySQL. Once this failover was completed, traffic to the database returned to normal. Events and notifications that were backlogged during the incident were re-enqueued while we monitored the overall health of the systems.
After investigating the incident further, we determined that there was a performance regression that was triggered under very specific conditions. To resolve this regression we linked the new version of MySQL to a version of OpenSSL (1.1.1) which did not have the same performance regression. Through load testing this new setup while mimicking the conditions encountered during the incident, we gained confidence that this solution is the correct one.
We’ve identified areas of improvement which will help us ensure that incidents of this nature are less likely, and can be more quickly resolved in the future, such as:
Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to firstname.lastname@example.org with these questions.