Issues with Slack and Jira Server/Cloud integrations
Incident Report for PagerDuty
Postmortem

Summary

On September 20, 2019, from 9:30 UTC to 13:27 UTC, PagerDuty experienced an issue that prevented actions initiated from within our Slack and Jira Cloud/Server integrations from succeeding.

What Happened

At the start of the incident, we provisioned a new web host to the fleet. One of our legacy services, which does not use our normal service discovery process, started sending requests to the new host before the host was ready. This resulted in failed requests related to callbacks initiated from integrations in response to webhooks being sent from PagerDuty. These callbacks were intended to execute actions within the PagerDuty application (typically acknowledging or resolving an incident), but because of the failure, these actions did not occur.

Our engineering team initially underestimated the customer impact, which led to a delay in posting to our status page. Once the customer impact was realized, our response team posted to the status page. Our engineering team then identified the problem and manually deployed the code to the new host, which enabled it to start processing requests from the legacy service.

What Are We Doing About This

We have updated our provisioning process to include manually deploying code to new hosts while we work to update our legacy service to a new service that uses our service discovery process. We are also updating our internal documentation to identify the dependencies relying on this legacy service so we can better communicate customer impact in the future.

We would like to express our sincere regret for the failures that resulted from this incident. For any questions, comments, or concerns, please contact us at support@pagerduty.com

Posted 29 days ago. Sep 24, 2019 - 22:39 UTC

Resolved
We are fully recovered.
Posted about 1 month ago. Sep 20, 2019 - 13:27 UTC
Monitoring
We have redeployed a service and are seeing signs of recovery. We will continue monitoring the results.
Posted about 1 month ago. Sep 20, 2019 - 13:05 UTC
Investigating
We are having issues processing requests from our Slack and Jira Server/Cloud integrations and are currently investigating.
Posted about 1 month ago. Sep 20, 2019 - 12:46 UTC
This incident affected: Webhooks.