On December 10, 2019 at 15:33 UTC, PagerDuty experienced a major incident that caused all of our Public V2 API endpoints to return inconsistent error messages, the most common experienced was a timeout error.
As a result of this issue all clients that were interfacing with our Public V2 API would have experience inconsistent behaviour on all actions.
Note that this incident did not affect our Events API or notification delivery, and no events or notifications were dropped during the course of this incident.
A version upgrade on our Public APIs frontend load balancers introduced a memory allocation bug which caused the inconsistent behaviour.
The version upgrade was made on November 7th and the bug did not surface until November 17th. The load balancers appeared to work properly for some time after startup, but the bug would only be triggered after they had been running for some time. Due to the nature of this particular bug and the time-based triggering behaviour, it was not fully identified and confirmed until November 22nd. At that time the bug was triaged and was designated as low impact, however we implemented a short term plan to restart the load balancers periodically to attempt to work around the bug.
A ticket was filed with the vendor that supplies our frontend load balancers software, and a new version with the bug fix incorporated was released on December 3rd. Due to an internal code freeze the work to upgrade to the latest version was pushed until the following week and was tentatively scheduled for December 11th.
On December 10th at 15:33 UTC a major incident response was initiated as a result of customer reports of inconsistent API behaviour. During the incident response the customer reports were correlated with the bug that we had previously observed and documented on November 22nd. A manual rolling restart of our frontend load balancers was initiated at 16:12 UTC to reset the services. The rolling restart resolved the memory allocation issue once again, and the Public API was returned to a fully functional state.
In retrospect it is now clear that the periodic scheduled restart of the load balancers was insufficient to address the issue. The incident response team decided to accelerate the deployment of the new frontend load balancer version which contained the bug fix and the new version was successfully deployed and tested that same day, December 10th, at 19:54 UTC.
As a takeaway from this incident, we identified a gap in the monitoring of our API traffic and filled that gap by adding new monitors to our API traffic to ensure that these types of issues are identified faster and proactively by our engineering teams.
Finally we would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org.