On August 16th at 22:30 UTC, we suffered a 45 minute degradation of our Slack integration. All trigger messages in Slack were sent during this time, however any updates to the status of incidents were lost, and incident acknowledgement and resolution buttons in Slack were non-functional.
A deploy to our infrastructure backing the Slack integration was the cause of this service degradation. This deploy prevented our infrastructure from tracking the message identifiers for messages we had already sent over to Slack.
Our monitoring solutions detected the issue after the code rolled out and we began our investigation. Ultimately, rolling back our code fixed the issue.
The code was tested properly in our test environment before deploying to production, however, a difference in configuration between the staging and production environment meant that our successful test in staging did not mean a successful production deploy. We have since fixed our configuration file to be the same as production to avoid this situation again.
We will also be improving our deploy procedure in a proactive and reactive manner. We will be adding a canary process to catch errors before they go out. On the reactive side, a delay in our rollback procedure cause the outage to last longer than necessary. We have already addressed that by making the rollback happen faster.
We would like to again apologize for any inconvenience this issue caused. If you have any questions, please do not hesitate to contact us at firstname.lastname@example.org.