On Aug 2nd, 2022 between 20:18 UTC and 20:49 UTC, PagerDuty experienced an incident in the US service region which resulted in delays to event processing and prevented users from performing certain actions on Incidents/Alerts (trigger, resolve, resume, merge). As a result of delays to event processing, customers also would have also experienced a delay in notifications. At 20:31 UTC, the cause of the incident was mitigated and systems began returning to normal. Incident/Alert actions were once again functional and we began making progress on the backlog of events. By 20:49 UTC, all delayed events were processed and all systems returned to normal.
As part of ensuring PagerDuty continues to operate on well supported versions of software dependencies, we completed a major version upgrade of the MySQL database used to power the Incident/Alert lifecycle. This upgrade completed at approximately 19:30 UTC on August 2nd. At 20:18 UTC, we began to observe an increase in HTTP requests timeouts for the application which leverages this database. This resulted in internal teams being paged to investigate the issue. The investigation uncovered that there was a high amount of lock contention on a database table which is essential for the Incident/Alert lifecycle. This lock contention caused requests to the database which interact with this table to hang. This resulted in failed HTTP responses to users and halted our ability to process events. At 20:31 UTC, a limit within the database was hit which allowed it to abort the hung requests. Due to this event, database requests were once again able to complete successfully. All database metrics and the HTTP error rates began to return to normal levels. Users were once again able to perform actions against Incidents/Alerts and we began making progress on the backlog of events. By 20:49 UTC, all delayed events were processed and all systems returned to normal.
Following the incident, the team was able to identify that there were write queries against one particular table which did not complete between 20:18 and 20:31 UTC during the impacted time period. This is behavior which we had not seen in the prior version of the data store nor did it appear in our testing of the new version. As a result:
We apologize for the inconvenience that this has caused. For any questions, comments, or concerns, please reach out to firstname.lastname@example.org