On November 2nd, 2017 from 14:59 to 16:21 UTC, PagerDuty experienced a degradation in our ability to process events. As a result, incident creation and incident notification delivery were delayed for some customers.
PagerDuty Engineers were doing preventative maintenance on one of our internal datastores, attempting to add capacity to it. Due to a bug with the datastore technology, the process of adding capacity to the datastore resulted in it being unavailable to internal PagerDuty services for a period of about 7 minutes. Due to a separate problem with our tooling, there was an additional period of partial availability for the same datastore that followed (approximately 22 minutes). After these events, our event ingestion pipeline had to process the backlog of pending events that had accumulated. A small subset of these events took a longer than expected time to be processed.
We will soon be upgrading our datastore software version to ensure that we are no longer vulnerable to the bug that initiated this degradation. We have also improved our tooling and documentation to address the tooling problem.
In terms of our event ingestion pipeline, we have done work to improve the parallelism of our event processing. This will allow us to recover much quicker in the future.
We would like to express our regret for the service degradation. For any questions, comments, or concerns, please reach out to firstname.lastname@example.org.