On April 9, 2021 between 19:45 and 20:15 UTC, PagerDuty experienced an incident for 30 minutes that resulted in duplicate notifications, and an inability to reply directly from Phone & SMS notifications. All other components (API, event ingestion, web and mobile applications) were unaffected by this incident.
A configuration change was made to our load balancer cluster that is responsible for routing requests from 3rd-party notification providers. This was a change which required a restart of the load balancer processes rather than a hot reload to take effect. For this change, a restart had not yet taken place. Around the time of this incident, a deployment of a service downstream from the load balancers took place. This resulted in the load balancers not loading the configuration needed to route the traffic to the new downstream instances. Upon detection of service degradation, our automated response process was triggered and our on-call engineers began rectifying the issue. Service was restored by restarting the load balancer processes across the fleet to load the correct configuration.
We understand how critical our platform is to you and your team(s), and are taking the appropriate next steps. We have identified areas of improvement and have prioritized preventative measures in making sure this particular issue does not repeat itself in the future. For any questions, comments, or concerns, please reach out to email@example.com.