On January 20th at 1:04pm UTC, we began to experience an issue with displaying incoming alerts in the Web UI. We implemented a solution for new incoming alerts at 3:25am UTC on January 21st. New alerts were being processed normally as of 8:19am UTC. After this point, we determined that there was still a backfill of old alerts that still needed to be processed for display in the Web Application. We then spun up a new Incident to determine the best way to proceed in processing the backlog that had accumulated.
Our engineers initially discovered that the cluster responsible for storing alerts had stopped accepting writes as a result of low disk space, and had subsequently entered read-only mode. They were able to confirm that there was no data loss or impact on notifications, our mobile app, or our APIs. Our engineers provisioned a new cluster with significantly more resources from our latest backup. All new alerts were routed to the new cluster. This mitigated the problem for new alerts, but did not address the alerts that had failed to write to the previous cluster since it had entered read-only mode.
Once this secondary issue was confirmed, a new Incident Call was spun up to determine a plan of action that would address the backlog of alerts that had not yet been processed for display purposes in our Web UI. To backfill the alerts that weren’t yet displaying in the Web UI, we started a separate background process that worked through the backlog of all alerts, starting with the oldest we had on record. This way all alerts were eventually restored and made accessible in the PagerDuty application. We decided to leave the Incident on our Status Page in a “Monitoring” state until the backfill processing was complete, to maintain full transparency with our customers.
Our engineers will be revamping our database monitors to ensure that they are operating as expected, and are able to alert us better before a failure occurs. We are also improving our internal guidelines on how to mitigate an issue like this faster.
We would like to apologize for this service interruption. For any questions, comments, or concerns, please reach out to firstname.lastname@example.org.