Incident Timeline and Alert Log issues
Incident Report for PagerDuty
Postmortem

Summary

On October 13th, 2021 between 19:24 UTC and 01:37 UTC, PagerDuty’s North American service region experienced an issue, which caused Incident Timelines and Alert Logs to not load correctly for new alerts and incidents. This data was missing from the website, the mobile application, and any API calls to the REST API that requested a new incident created at this time. Notification delivery, event ingestion, acknowledgment, and resolution functionality, and all other web, mobile, and API functionality were not impacted.

What Happened

Due to the growth and continued success of PagerDuty, we’ve been working to increase system capacity and our ability to serve more customers. One of these improvements included the expansion of the keyspace for Incident and Alert IDs. This allows for a much greater capacity to serve our customers in the long run. Alert and Incident IDs are widely used and referenced across the systems and micro-services that make up the PagerDuty platform. A change to something as fundamental as this requires many coordinated changes across systems and services to occur without adverse impact to the Pagerduty platform.

On October 13th, we undertook the final phase of this improvement process by pushing all Alert and Incident IDs over to the new threshold as part of a controlled change to ensure that we could quickly respond if the need arose. Immediately after the incident change was committed, the service that is responsible for Incident Timeline and Alert Log functionality began to suffer from increased error rates during write operations.

The engineering teams involved identified the source of the issue as being a single column in the database for this service missing the appropriate data type improvement. The action to rectify this was quickly initiated which involved an in-place schema change on the column to bring it in line with expectations. Due to the nature of this particular change, we deemed it safer to fix the issue forward than to roll back the wide set of changes that were made across systems and datastores. We estimated that the migration itself would take approximately 3 hours to complete, and the data backfill operation would take an additional 2 hours to catch up with data that had been queued waiting to be written during the schema change.

Once the migration and backfill were complete, the Incident Timelines and Alert Logs functionality was completely restored and users had full access to incident timeline data with no loss of data.

What We Are Doing About This

We’ve identified areas of improvement which will help us ensure that incidents of this nature are less likely in the future, such as:

  • Comparative analysis of schemas across environments and service regions to ensure consistent outcomes.
  • Improvement of schema change tooling to ensure consistent outcomes regardless of database size.

Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.

Posted Oct 27, 2021 - 22:36 UTC

Resolved
This incident has been resolved.
Posted Oct 14, 2021 - 02:51 UTC
Update
We have confirmed that the affected systems are now functioning normally, and are continuing to monitor to verify full recovery
Posted Oct 14, 2021 - 01:43 UTC
Update
We are observing full recovery of incident log entries functionality. We will continue to monitor the state of our systems and will share additional updates within an hour.
Posted Oct 14, 2021 - 01:19 UTC
Update
We are observing progress in processing of the delayed incident log entries, but will continue to actively monitor the outcome of the fix. We will be following up with hourly updates until we confirm the full recovery.
Posted Oct 14, 2021 - 00:16 UTC
Update
We have completed our fix for the log entry issues. We are now focused on processing delayed log entries. Customers may still experience delayed entries in their timeline. We will continue to post hourly updates until we have fully recovered.
Posted Oct 13, 2021 - 23:27 UTC
Update
We continue to monitor our fix underway. We anticipate the resolution for incident timeline and alert logs will complete within the next hour. However, it may still take several hours after that before we process all delayed log entries. There will be no data loss. We will continue to provide updates on progress hourly until this has been fully remediated.
Posted Oct 13, 2021 - 22:39 UTC
Update
We are continuing to monitor for any further issues.
Posted Oct 13, 2021 - 22:38 UTC
Monitoring
We are monitoring a fix that is underway. Log entries for new incidents are still impacted. We will continue to provide updates on progress hourly until this has been fully remediated.
Posted Oct 13, 2021 - 21:30 UTC
Update
A fix is currently being applied. We are still estimating that this may take several hours. We will continue to provide updates on progress hourly until this has been fully remediated.
Posted Oct 13, 2021 - 20:22 UTC
Identified
PagerDuty conducted a migration of our Incidents field to Bigint experienced an issue related to the incidents timeline. The team is working to restore this service. The planned migration/fix will take up to two hours. During that time any new incidents that are created may not show up in the incidents timeline and alert log. No data will be lost, but may be delayed in appearing in the page or in search results. We will provide an update in 30 minutes on the progress of the fix. Notifications and incident creation are not affected.
Posted Oct 13, 2021 - 19:53 UTC
This incident affected: Mobile Application (Mobile Application (US)), Web Application (Web Application (US)), and REST API (REST API (US)).