Issue Viewing Alerts in Web UI
Incident Report for PagerDuty
Postmortem

Summary

On January 20th at 1:04pm UTC, we began to experience an issue with displaying incoming alerts in the Web UI. We implemented a solution for new incoming alerts at 3:25am UTC on January 21st. New alerts were being processed normally as of 8:19am UTC. After this point, we determined that there was still a backfill of old alerts that still needed to be processed for display in the Web Application. We then spun up a new Incident to determine the best way to proceed in processing the backlog that had accumulated.

What Happened?

Our engineers initially discovered that the cluster responsible for storing alerts had stopped accepting writes as a result of low disk space, and had subsequently entered read-only mode. They were able to confirm that there was no data loss or impact on notifications, our mobile app, or our APIs. Our engineers provisioned a new cluster with significantly more resources from our latest backup. All new alerts were routed to the new cluster. This mitigated the problem for new alerts, but did not address the alerts that had failed to write to the previous cluster since it had entered read-only mode.

Once this secondary issue was confirmed, a new Incident Call was spun up to determine a plan of action that would address the backlog of alerts that had not yet been processed for display purposes in our Web UI. To backfill the alerts that weren’t yet displaying in the Web UI, we started a separate background process that worked through the backlog of all alerts, starting with the oldest we had on record. This way all alerts were eventually restored and made accessible in the PagerDuty application. We decided to leave the Incident on our Status Page in a “Monitoring” state until the backfill processing was complete, to maintain full transparency with our customers.

What Are We Doing About This?

Our engineers will be revamping our database monitors to ensure that they are operating as expected, and are able to alert us better before a failure occurs. We are also improving our internal guidelines on how to mitigate an issue like this faster.

We would like to apologize for this service interruption. For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted 5 months ago. Mar 01, 2019 - 21:02 UTC

Resolved
We have finished processing backlogged alerts and this incident is now resolved. All alerts should now be visible in the Web UI under their associated incidents and in the Alerts tab.
Posted 6 months ago. Jan 23, 2019 - 07:15 UTC
Monitoring
From the hours of 1:04pm UTC on Jan 20 to 8:19 am UTC today, Jan 21, we experienced an issue with individual alerts not displaying properly in the Web UI. This issue has since been resolved and new alerts are not affected. We are now processing a backlog of alerts which will appear under their corresponding incidents normally over the course of the next few hours. It should be noted that the Web UI was the only affected component and this did not impact our APIs, the mobile app, or notifications. This incident will be updated once we’ve finished processing these alerts.
Posted 6 months ago. Jan 21, 2019 - 19:00 UTC
This incident affected: Web Application.