On July 16 from 14:55 to 17:48 UTC, PagerDuty experienced an outage in event ingestion and processing, and in notification delivery. During this time, incidents could not be triggered through the Events API, and notifications were not sent.
During a network outage impacting one of our cloud infrastructure providers, a network partition became unreachable. When this happened, hosts in PagerDuty's fleet that run the datastore software used by the event and notification processing pipelines lost contact with a leader node in the affected partition.
Normally, the hosts would failover to an in-sync replica of the leader, but they exhibited an as-yet unexplained behavior in that they treated all of the in-sync replicas of the unreachable node as also missing. As a result, the event processing and notification services halted, and the Events API soon stopped accepting new events, issuing status 500 responses.
Once the issue was identified, the nodes in the problematic network partition were decommissioned, and the rest of the cluster was reconfigured to take over and resume work. This allowed event and notification processing to resume.
We have updated the configuration on our datastore software to enable prompt failover in the case of future major network disruptions. Investigation into the unexpected behavior of the datastore software is still ongoing. We are also continuing to make architectural changes that will improve availability and reduce the impact of network issues on our services.
We would like to express our regret for the service interruption. For any questions, comments, or concerns, please reach out to firstname.lastname@example.org