Events API degraded performance
Incident Report for PagerDuty
Postmortem

Summary

On June 2nd, 2020 between 17:17 UTC (13:17 EDT/ 10:17 PDT) and 20:20 UTC (16:20 EDT/ 13:20 PDT) PagerDuty experienced an incident which impacted event processing, notification processing, and status updates.

During this time, some events sent via the Events API were accepted, but were not being processed immediately which resulted in delays for incident creation and subsequent notifications. 23.8% of notifications took more than 5 minutes to send and 36.3% of incoming events were throttled. Web and Mobile traffic also saw elevated error rates, peaking at 6%, and had an average of 0.08%.

What Happened

Behind the scenes at PagerDuty we’ve been working diligently to upgrade our MySQL database infrastructure in order to provide greater scalability and throughput to our customers. Since February, we’ve successfully upgraded dozens of our database clusters without issue.

On June 2nd, a change was introduced to upgrade the final primary database from the previous version of MySQL to an updated version. Once this change was completed, our monitoring systems indicated elevated response times and error rates for our event processing systems. While some response time differences were expected as the new database warmed up, the trend that we were seeing did not match any of the heuristics that we had gathered in our testing.

The engineering teams involved in the incident worked to stabilize the system with various remediation actions such as rolling restarts of client applications/services and adjusting capacity of the applications/services which depend on these databases. We saw short-lived improvements of these services from our actions, that caused us to prematurely update our status page saying that the issue had been resolved, but the system performance was not back to expected levels and the degraded performance persisted.

After performing these resets and gathering enough diagnostic information, we made the decision to rollback the database change and return to a known good state using the previous version.

The rollback of the change immediately stabilized our event processing systems. Events and Notifications that were backlogged were processed within 30 minutes.

Contributing Factors

  • In the new version of MySQL, there was a performance regression introduced which becomes most visible when a large number (>~1,000) of clients are connected to an individual database host. This specific regression was determined to be caused by OpenSSL 1.0 which the new version of the database engine was using. The SSL library (YASSL) used by the previous version of the database did not have this performance issue.
  • As part of our preparation for this upgrade, we performed several rounds of load and stress testing in various pre-production and production environments. The upgrade process itself consisted of deploying canary hosts in all environments and gradually ramping up the amount of traffic they received over a period of 5 months. However, our testing didn’t account for the specific conditions that would have actually triggered this performance regression.

Resolution

After troubleshooting the possible contributing factors for the degraded performance, we decided to rollback and we performed a failover to the standby database which was running the previous version of MySQL. Once this failover was completed, traffic to the database returned to normal. Events and notifications that were backlogged during the incident were re-enqueued while we monitored the overall health of the systems.

After investigating the incident further, we determined that there was a performance regression that was triggered under very specific conditions. To resolve this regression we linked the new version of MySQL to a version of OpenSSL (1.1.1) which did not have the same performance regression. Through load testing this new setup while mimicking the conditions encountered during the incident, we gained confidence that this solution is the correct one.

What We Are Doing About This

We’ve identified areas of improvement which will help us ensure that incidents of this nature are less likely, and can be more quickly resolved in the future, such as:

  • Generating more representative load in the load test environment to identify these types of regressions more readily.
  • Reducing the time it takes to decide to roll back by defaulting to rolling back after a pre agreed set of time.
  • Improving our communications process for major incidents on our status page.

Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.

Posted Jun 24, 2020 - 21:32 UTC

Resolved
We are fully recovered. All systems are functioning normally.
Posted Jun 02, 2020 - 20:24 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 02, 2020 - 19:37 UTC
Identified
After several additional remediation steps we have rolled back an earlier upgrade, internal metrics look to be returning to original performance.
Posted Jun 02, 2020 - 19:31 UTC
Update
We are continuing to investigate this issue.
Posted Jun 02, 2020 - 19:28 UTC
Investigating
We are continuing to investigate this issue.
Posted Jun 02, 2020 - 18:34 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 02, 2020 - 17:44 UTC
Investigating
We performed a database failover and are seeing delays in internal service communications. All events are being processed, but with degraded performance.
Posted Jun 02, 2020 - 17:41 UTC
This incident affected: Events API and REST API.