Issue affecting Events API
Incident Report for PagerDuty
Postmortem

Summary

On February 19 from 23:30 UTC to 01:00 UTC PagerDuty experienced a major incident that caused a degradation of event ingestion via the Events API. Then, on February 20 from 16:58 UTC to 17:47 UTC we had a degradation in processing of email events.

During these times, customers interacting with our Events API or inbound emails to email integrations would have experienced delayed notifications.

What Happened

PagerDuty leverages HAProxy as both a load balancer as well as a local proxy. Our configurations for HAProxy are managed via an infrastructure codebase.

On February 19 at 11:50 UTC, a change intended to upgrade a single set of load balancers was applied to our infrastructure codebase. The intended scope of this change was limited to the load balancers for our databases, which are a subset of our load balancers. Due to a previously unknown dependency between modules in our infrastructure codebase, this config change was applied to all servers in our infrastructure, going beyond the intended scope of only the database load balancers. This change resulted in the upgrade of the load balancer software to a newer version, which was incompatible with many of our load balancers. These issues presented themselves in the form of internal services not being able to successfully communicate to each other via their respective APIs, which lead to a delay in event processing. Some load balancer failures were immediate, while others only exhibited signs of trouble when their respective processes were reloaded.

The incidents were resolved using a combination of downgrading the unintentionally upgraded software on some load balancers or upgrading the load balancer software so that it was compatible with the newer version.

What We Are Doing About This

As part of our post-mortem analysis, we’ve identified areas of improvement which will help us ensure that incidents of this nature are less likely, and can be more quickly identified in the future:

  • We’re correcting the dependency chain in our infrastructure codebase to avoid this specific situation again
  • We’re changing the way our configuration deployments perform validation to ensure they explicitly fail when they encounter these types of incompatibilities
  • We’re adding more explicit software version pinning to our configuration management to avoid unintentional upgrades or version changes

We apologize for any impact on our customers. If you have any further questions, please reach out to support@pagerduty.com

Posted Feb 28, 2020 - 19:51 UTC

Resolved
Our Events API has been successfully recovered.
Posted Feb 20, 2020 - 00:50 UTC
Identified
We have identified the issue behind our Events API and are rolling out a solution.
Posted Feb 20, 2020 - 00:22 UTC
Investigating
We are experiencing an issue where our Events API is down and are currently working to recover.
Posted Feb 20, 2020 - 00:11 UTC
This incident affected: Events API.