Starting on September 11, 2019 at 14:40 UTC and ending on September 11, 2019 at 17:17 UTC we had an internal networking communication disruption. An internal system used for service discovery came under high load and had delayed response times. As a result, multiple services related to how we group alerts were unable to talk to each other. This was in the midst of a planned infrastructure migration of old, manually provisioned and managed AWS EC2 hosts to a newer containerized orchestration environment.
During the incident, a percentage of our notifications were significantly delayed due to the delay in processing events. We did not drop any events during this time. PagerDuty’s web, mobile apps, and PagerDuty’s REST APIs were functional during the entire time of the incident and remained available.
Due to a few key differences in traffic profiles between our test and production environments, these new hosts were unable to process traffic quickly enough in certain situations. This subsequently led to these hosts failing health checks, causing them to be marked as unhealthy and no longer eligible to accept traffic. This depletion resulted in more traffic being shifted to already struggling nodes, pushing them over as well, while new hosts were not able to be spun up quickly enough without additional impact or resource exhaustion.
A rollback was initiated, but the older fleet had been scaled down over time to free up resources. This meant that the rollback took longer than expected as this requires re-provisioning a sufficient number of machines to handle the full load. Once the old environment had been scaled back up, traffic was redirected and event processing resumed as normal. By 16:45 UTC, we had signs of recovery.
The migration is still planned, but the plan has been revisited to add more logging and monitoring, and to better facilitate failure rollback scenarios. We have also tagged various key members of partner teams to increase overall visibility and reduce unexpected impact.
We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org.