Issues Creating Events
Incident Report for PagerDuty
Postmortem

Summary

On November 10th, at 01:01 UTC, we suffered a delay in notification delivery after an automated security process inadvertently revoked all AWS permissions for our internal services, affecting our ability to process events in a timely manner. During this time notifications were delayed for an average of 31 minutes until 02:47 UTC.

We understand that customers expect and require that our notifications are delivered promptly, and that any delay in delivery is unacceptable. We sincerely apologize for any inconvenience caused by this incident. Rest assured that we have already taken steps to improve our tooling and processes to ensure that such incidents cannot happen again.

If you have any questions about this incident, please do not hesitate to contact our support team at support@pagerduty.com.

What Happened?

PagerDuty uses automated tooling to disable unused users on our AWS accounts. This tool identifies any users who have not logged in or used their permissions within a certain timeframe, and then disables their access, notifying the user that their account has been deactivated for inactivity.

Many PagerDuty services make use of IAM Roles to authenticate with AWS, however some services still require us to use IAM Users instead. These users do not have any AWS console login ability, having only API access instead.

A change was made to our automated tooling causing it to incorrectly identify users who lack a console login as eligible for deactivation. Since those users had never logged in, the tool identified these as unused accounts and revoked their access. This caused all of the API keys in use by those users to be unable to perform any action.

Once the issue was identified, permissions were restored, and we began processing the backlog of events at 01:34 UTC.

We continued to process through the backlog of events until 02:47 UTC, at which point the backlog was cleared and PagerDuty systems were restored to normal operations.

What Are We Doing About This?

We have taken several steps to ensure a repeat incident does not occur, and we have additional projects planned to reduce the likelihood of any similar incidents occurring in future.

  • The bug in our automated tool has been fixed, and extra checks are now in place to ensure that it does not incorrectly identify users.

  • The testing process for the tooling did not surface the issue as the tool made no distinction between an ineligible user and an already deactivated user in its testing output. This has also been fixed.

  • We have work planned to add more automation around AWS permissions, which would give us better visibility into permission changes, and provide us with faster methods of changing access.

Again, please do not hesitate to contact our support team at support@pagerduty.com if you have any questions.

Posted 12 months ago. Nov 13, 2017 - 23:11 UTC

Resolved
We have recovered and events are flowing normally. We are continuing to monitor the situation.
Posted about 1 year ago. Nov 10, 2017 - 02:45 UTC
Update
We're seeing recovery and are continuing to monitor the issue.
Posted about 1 year ago. Nov 10, 2017 - 02:24 UTC
Monitoring
We've taken actions to correct the issue and the delay in event processing is decreasing.
Posted about 1 year ago. Nov 10, 2017 - 02:06 UTC
Identified
We’ve identified the issue causing delays in event processing and are actively working to mitigate it.
Posted about 1 year ago. Nov 10, 2017 - 01:46 UTC
Investigating
We are currently investigating delays in incoming event processing.
Posted about 1 year ago. Nov 10, 2017 - 01:34 UTC