Event API Delays
Incident Report for PagerDuty
Postmortem

Summary

On Aug 2nd, 2022 between 20:18 UTC and 20:49 UTC, PagerDuty experienced an incident in the US service region which resulted in delays to event processing and prevented users from performing certain actions on Incidents/Alerts (trigger, resolve, resume, merge). As a result of delays to event processing, customers also would have also experienced a delay in notifications. At 20:31 UTC, the cause of the incident was mitigated and systems began returning to normal. Incident/Alert actions were once again functional and we began making progress on the backlog of events. By 20:49 UTC, all delayed events were processed and all systems returned to normal.

What Happened

As part of ensuring PagerDuty continues to operate on well supported versions of software dependencies, we completed a major version upgrade of the MySQL database used to power the Incident/Alert lifecycle. This upgrade completed at approximately 19:30 UTC on August 2nd. At 20:18 UTC, we began to observe an increase in HTTP requests timeouts for the application which leverages this database. This resulted in internal teams being paged to investigate the issue. The investigation uncovered that there was a high amount of lock contention on a database table which is essential for the Incident/Alert lifecycle. This lock contention caused requests to the database which interact with this table to hang. This resulted in failed HTTP responses to users and halted our ability to process events. At 20:31 UTC, a limit within the database was hit which allowed it to abort the hung requests. Due to this event, database requests were once again able to complete successfully. All database metrics and the HTTP error rates began to return to normal levels. Users were once again able to perform actions against Incidents/Alerts and we began making progress on the backlog of events. By 20:49 UTC, all delayed events were processed and all systems returned to normal.

What We Are Doing About This

Following the incident, the team was able to identify that there were write queries against one particular table which did not complete between 20:18 and 20:31 UTC during the impacted time period. This is behavior which we had not seen in the prior version of the data store nor did it appear in our testing of the new version. As a result:

  • We’ve implemented a query monitoring and killing solution which will prevent a repeat of these long running queries/transactions.
  • We’ve reviewed the timeout settings configured on the data store to ensure they’re correctly tuned.
  • We’ll be reviewing the ordering of queries against the problematic table to minimize/eliminate the known deadlocking query patterns.
  • We’ll be adding more logging from our database and application which will help us to more quickly troubleshoot similar issues in the future.

We apologize for the inconvenience that this has caused. For any questions, comments, or concerns, please reach out to support@pagerduty.com

Posted Aug 19, 2022 - 18:40 UTC

Resolved
We have resolved an incident where PagerDuty customers in the US service region experienced issues with delays in processing events on the Events API. The incident is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Aug 02, 2022 - 21:00 UTC
Monitoring
We are continuing to monitor the incident and we are starting to notice signs of recovery. Events are currently being processed without delay. We will provide an update within 20 minutes or as soon as there is a change in the status of the incident.
Posted Aug 02, 2022 - 20:54 UTC
Investigating
We are investigating an incident where PagerDuty customers in the US Service Region are experiencing processing delays with the events API. We will provide further updates within 20 minutes.
Posted Aug 02, 2022 - 20:36 UTC
This incident affected: Events API (Events API (US)).