On October 20, 2022, from approximately 10:40 UTC and lasting until 10:42 UTC, some customers could not connect, retrieving 500s errors with web and mobile sites as well as the REST and Events APIs. Webhooks and notifications were also delayed, and event ingestion was throttled for some customers. We continued to process inbound emails without interruption throughout the incident.
The outage was precipitated by network connectivity issues starting at 10:28 UTC and lasting until 10:44 UTC, between our service region and one of the disaster recovery regions. Leading indicators of degradation in our system did not appear until 10:33, with the effects limited to internal systems. Customer impact began at 10:40 UTC. At that time, critical path services with hard dependencies on the service region restarted and could not start up correctly due to network inaccessibility. These services, as well as all impacted internal services, began self-healing as network connectivity was restored. Alerts notified responders, and a major incident was automatically triggered at 10:44 UTC. We observed recovery starting at 10:42 UTC, which coincided with the end of failed customer events. Some services took several minutes to recover fully, and clean-up actions and throttle removals continued until as late as 11:30 UTC.
We have identified several latent issues that negatively impacted reliability in the face of network connectivity problems and are working to remove those. Also we plan to improve our network connectivity detection. Finally we will reproduce this scenario in a non-production environment to verify future fault tolerance. We apologize for the inconvenience that this has caused. For any questions, comments, or concerns, please contact us at support@pagerduty.com