Notification delivery issues for EU customers
Incident Report for PagerDuty
Postmortem

Summary

On June 9th, 2022 from 22:22 UTC through 23:40 UTC, an infrastructure failure caused a quarter of Incident Notifications to be delayed across all channels (Voice, SMS, Email) for customers sending Notifications via our EU region.

Other notifications in the EU region, such as stakeholder notifications, on-call handoffs, and responder requests, were not impacted. Notifications and other communications in all other regions were not impacted.

What Happened

On June 9th, 2022 at 22:22 UTC, the service responsible for grouping and scheduling outbound notifications for Voice, SMS, and Emails was unable to reference an outdated load-balancer sidecar container that was no longer available. Consequently, a quarter of Incident Notifications for customers utilizing our EU region were delayed.

PagerDuty’s internal continuous deployment system alerted the engineering team, who swiftly updated the service to reference an active container and redeployed the portion of the affected infrastructure.

After the redeployment at 23:40 UTC, the service was fully online, and all the backlogged notifications were successfully processed and delivered.

What Are We Doing About This

Following this incident, our teams completed a review of our infrastructure. We will be updating our monitoring and tooling system to detect references to outdated containers in all our services, preventing such an issue from occurring again.

We apologize for the inconvenience that this has caused. For any questions, comments, or concerns, please reach out to support@pagerduty.com

Posted Jun 30, 2022 - 22:46 UTC

Resolved
We have resolved an incident where PagerDuty customers in the EU service region experienced issues with degraded incident notification delivery rates. The issue is now resolved, and there is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Jun 09, 2022 - 23:45 UTC
Update
We have identified that only one quarter of incident notifications in the EU Service Region have been affected by this issue. We are still actively working on mitigation and will continue to provide further updates as they become available.
Posted Jun 09, 2022 - 23:32 UTC
Update
We are continuing to investigate an incident where all PagerDuty customers in the EU Service Region are experiencing issues with receiving incident notifications. Other notifications continue to be unaffected. We have identified a potential fix for the issue and will provide further updates within 20 minutes.
Posted Jun 09, 2022 - 23:23 UTC
Identified
We are investigating an incident where all PagerDuty customers in the EU Service Region are experiencing issues with incident notification delivery. Impacted customers may not receive incident notifications at this time, however other notifications such as stakeholder notifications, on-call handoffs, and responder requests are functioning as expected. We are working to remediate the issue and will provide further updates within 20 minutes.
Posted Jun 09, 2022 - 23:04 UTC
Investigating
We are investigating issues with PagerDuty notification delivery for EU customers. We will update with further impact and severity within 15 minutes.
Posted Jun 09, 2022 - 22:49 UTC
This incident affected: Notification Delivery (Notification Delivery (EU)).