During a period beginning 15:24 UTC on Wednesday, December 15 and ending at 18:55 UTC Wednesday, December 15, PagerDuty experienced an incident that caused the web & mobile applications to be intermittently inaccessible, and also caused incident notifications and status updates to be delayed. The public APIs (REST APIs and Events APIs) also experienced periods of degraded performance and inaccessibility during this time.
The incident was triggered by an issue with our network infrastructure that caused connections coming into PagerDuty to fail or experience high latency. These issues were resolved around 16:10 UTC. The remaining time was spent assessing our systems to ensure full stability.
During the incident, customer impact was as follows:
For notifications, the total time frame of notable impact was from 15:11 UTC until 16:48 UTC (2021-Dec-15). During this time, about 76.8% of notifications were out of SLA, with 36.44% of accounts and 1.35% of users being affected.
During the incident, the team monitored the recovery of the network infrastructure as well as the performance of our systems, and corrected areas where the network instability had caused backups of work queues.
To improve reliability in the future, we will:
For any questions, comments, or concerns, please reach out to support@pagerduty.com.