On February 19 from 23:30 UTC to 01:00 UTC PagerDuty experienced a major incident that caused a degradation of event ingestion via the Events API. Then, on February 20 from 16:58 UTC to 17:47 UTC we had a degradation in processing of email events.
During these times, customers interacting with our Events API or inbound emails to email integrations would have experienced delayed notifications.
PagerDuty leverages HAProxy as both a load balancer as well as a local proxy. Our configurations for HAProxy are managed via an infrastructure codebase.
On February 19 at 11:50 UTC, a change intended to upgrade a single set of load balancers was applied to our infrastructure codebase. The intended scope of this change was limited to the load balancers for our databases, which are a subset of our load balancers. Due to a previously unknown dependency between modules in our infrastructure codebase, this config change was applied to all servers in our infrastructure, going beyond the intended scope of only the database load balancers. This change resulted in the upgrade of the load balancer software to a newer version, which was incompatible with many of our load balancers. These issues presented themselves in the form of internal services not being able to successfully communicate to each other via their respective APIs, which lead to a delay in event processing. Some load balancer failures were immediate, while others only exhibited signs of trouble when their respective processes were reloaded.
The incidents were resolved using a combination of downgrading the unintentionally upgraded software on some load balancers or upgrading the load balancer software so that it was compatible with the newer version.
As part of our post-mortem analysis, we’ve identified areas of improvement which will help us ensure that incidents of this nature are less likely, and can be more quickly identified in the future:
We apologize for any impact on our customers. If you have any further questions, please reach out to firstname.lastname@example.org