Problems with Notification Delivery and Response
Incident Report for PagerDuty
Postmortem

Summary

On April 9, 2021 between 19:45 and 20:15 UTC, PagerDuty experienced an incident for 30 minutes that resulted in duplicate notifications, and an inability to reply directly from Phone & SMS notifications. All other components (API, event ingestion, web and mobile applications) were unaffected by this incident.

What Happened

A configuration change was made to our load balancer cluster that is responsible for routing requests from 3rd-party notification providers. This was a change which required a restart of the load balancer processes rather than a hot reload to take effect. For this change, a restart had not yet taken place. Around the time of this incident, a deployment of a service downstream from the load balancers took place. This resulted in the load balancers not loading the configuration needed to route the traffic to the new downstream instances. Upon detection of service degradation, our automated response process was triggered and our on-call engineers began rectifying the issue. Service was restored by restarting the load balancer processes across the fleet to load the correct configuration.

What We Are Doing About This

We understand how critical our platform is to you and your team(s), and are taking the appropriate next steps. We have identified areas of improvement and have prioritized preventative measures in making sure this particular issue does not repeat itself in the future. For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted Jul 14, 2021 - 16:21 UTC

Resolved
We have fully recovered. Notification delivery and response is functioning as expected.
Posted Apr 09, 2021 - 20:36 UTC
Update
After implementing a fix, we are seeing signs of recovery. It should now be possible to respond to notifications delivered via phone and SMS.
Posted Apr 09, 2021 - 20:21 UTC
Monitoring
We are currently experiencing an issue with notification delivery. This may cause duplicate notifications and interfere with the ability to respond to notifications.
Posted Apr 09, 2021 - 20:17 UTC
This incident affected: Notification Delivery (Notification Delivery (US)).