On October 13th, 2021 between 19:24 UTC and 01:37 UTC, PagerDuty’s North American service region experienced an issue, which caused Incident Timelines and Alert Logs to not load correctly for new alerts and incidents. This data was missing from the website, the mobile application, and any API calls to the REST API that requested a new incident created at this time. Notification delivery, event ingestion, acknowledgment, and resolution functionality, and all other web, mobile, and API functionality were not impacted.
Due to the growth and continued success of PagerDuty, we’ve been working to increase system capacity and our ability to serve more customers. One of these improvements included the expansion of the keyspace for Incident and Alert IDs. This allows for a much greater capacity to serve our customers in the long run. Alert and Incident IDs are widely used and referenced across the systems and micro-services that make up the PagerDuty platform. A change to something as fundamental as this requires many coordinated changes across systems and services to occur without adverse impact to the Pagerduty platform.
On October 13th, we undertook the final phase of this improvement process by pushing all Alert and Incident IDs over to the new threshold as part of a controlled change to ensure that we could quickly respond if the need arose. Immediately after the incident change was committed, the service that is responsible for Incident Timeline and Alert Log functionality began to suffer from increased error rates during write operations.
The engineering teams involved identified the source of the issue as being a single column in the database for this service missing the appropriate data type improvement. The action to rectify this was quickly initiated which involved an in-place schema change on the column to bring it in line with expectations. Due to the nature of this particular change, we deemed it safer to fix the issue forward than to roll back the wide set of changes that were made across systems and datastores. We estimated that the migration itself would take approximately 3 hours to complete, and the data backfill operation would take an additional 2 hours to catch up with data that had been queued waiting to be written during the schema change.
Once the migration and backfill were complete, the Incident Timelines and Alert Logs functionality was completely restored and users had full access to incident timeline data with no loss of data.
We’ve identified areas of improvement which will help us ensure that incidents of this nature are less likely in the future, such as:
Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to firstname.lastname@example.org with these questions.