On September 20, 2019, from 9:30 UTC to 13:27 UTC, PagerDuty experienced an issue that prevented actions initiated from within our Slack and Jira Cloud/Server integrations from succeeding.
At the start of the incident, we provisioned a new web host to the fleet. One of our legacy services, which does not use our normal service discovery process, started sending requests to the new host before the host was ready. This resulted in failed requests related to callbacks initiated from integrations in response to webhooks being sent from PagerDuty. These callbacks were intended to execute actions within the PagerDuty application (typically acknowledging or resolving an incident), but because of the failure, these actions did not occur.
Our engineering team initially underestimated the customer impact, which led to a delay in posting to our status page. Once the customer impact was realized, our response team posted to the status page. Our engineering team then identified the problem and manually deployed the code to the new host, which enabled it to start processing requests from the legacy service.
We have updated our provisioning process to include manually deploying code to new hosts while we work to update our legacy service to a new service that uses our service discovery process. We are also updating our internal documentation to identify the dependencies relying on this legacy service so we can better communicate customer impact in the future.
We would like to express our sincere regret for the failures that resulted from this incident. For any questions, comments, or concerns, please contact us at email@example.com