On November 10th, at 01:01 UTC, we suffered a delay in notification delivery after an automated security process inadvertently revoked all AWS permissions for our internal services, affecting our ability to process events in a timely manner. During this time notifications were delayed for an average of 31 minutes until 02:47 UTC.
We understand that customers expect and require that our notifications are delivered promptly, and that any delay in delivery is unacceptable. We sincerely apologize for any inconvenience caused by this incident. Rest assured that we have already taken steps to improve our tooling and processes to ensure that such incidents cannot happen again.
If you have any questions about this incident, please do not hesitate to contact our support team at email@example.com.
PagerDuty uses automated tooling to disable unused users on our AWS accounts. This tool identifies any users who have not logged in or used their permissions within a certain timeframe, and then disables their access, notifying the user that their account has been deactivated for inactivity.
Many PagerDuty services make use of IAM Roles to authenticate with AWS, however some services still require us to use IAM Users instead. These users do not have any AWS console login ability, having only API access instead.
A change was made to our automated tooling causing it to incorrectly identify users who lack a console login as eligible for deactivation. Since those users had never logged in, the tool identified these as unused accounts and revoked their access. This caused all of the API keys in use by those users to be unable to perform any action.
Once the issue was identified, permissions were restored, and we began processing the backlog of events at 01:34 UTC.
We continued to process through the backlog of events until 02:47 UTC, at which point the backlog was cleared and PagerDuty systems were restored to normal operations.
We have taken several steps to ensure a repeat incident does not occur, and we have additional projects planned to reduce the likelihood of any similar incidents occurring in future.
The bug in our automated tool has been fixed, and extra checks are now in place to ensure that it does not incorrectly identify users.
The testing process for the tooling did not surface the issue as the tool made no distinction between an ineligible user and an already deactivated user in its testing output. This has also been fixed.
We have work planned to add more automation around AWS permissions, which would give us better visibility into permission changes, and provide us with faster methods of changing access.
Again, please do not hesitate to contact our support team at firstname.lastname@example.org if you have any questions.