On May 21 from 09:30 to 12:30 UTC, PagerDuty experienced a degradation in our ability to process events. As a result, incident creation and incident notification delivery were delayed for some customers.
A particular instance of a datastore software, which serves as part of our event processing pipeline, experienced localized network issues. Once our engineers identified this, they were able to remove the affected instance from the service. Our systems then recovered and we declared the incident resolved at 10:50 UTC. This initial incident was recorded as a separate incident, on this page: https://status.pagerduty.com/incidents/6sq06fbfxffg
However, in the process of replacing the removed instance, a bug in the datastore software resulted in additional complications, which led to further service impact beginning at around 12:03 UTC. Once this was identified, we reverted the deployment of the replacement instance and applied a manual workaround.
In each case, following impact to the service, our event ingestion pipeline worked to process the backlog of pending events that had accumulated behind the affected part of the event pipeline. After the latter impact, our systems fully recovered again at 12:31 UTC.
We have upgraded our datastore software version to ensure that we are no longer vulnerable to the bug that initiated this degradation. We have also been making architectural changes that improve availability during network issues, and perform tests of configuration changes extensively as part of our deployment pipeline.
We would like to express our regret for this service interruption. For any questions, comments, or concerns, please reach out to email@example.com