Issues with accessing some PD accounts
Incident Report for PagerDuty
Postmortem

Summary

During a period beginning 15:24 UTC on Wednesday, December 15 and ending at 18:55 UTC Wednesday, December 15, PagerDuty experienced an incident that caused the web & mobile applications to be intermittently inaccessible, and also caused incident notifications and status updates to be delayed.  The public APIs (REST APIs and Events APIs) also experienced periods of degraded performance and inaccessibility during this time.

What Happened

The incident was triggered by an issue with our network infrastructure that caused connections coming into PagerDuty to fail or experience high latency. These issues were resolved  around 16:10 UTC. The remaining time was spent assessing our systems to ensure full stability.

During the incident, customer impact was as follows:

For notifications, the total time frame of notable impact was from 15:11 UTC until 16:48 UTC (2021-Dec-15). During this time, about 76.8% of notifications were out of SLA, with 36.44% of accounts and 1.35% of users being affected.

  • Event ingestion via the API was impacted starting at 15:12 UTC and recovered at  approximately 15:56 UTC. During this time, many requests failed to reach PagerDuty servers.  However, of the messages received, none were dropped.
  • The rest of the public API did not experience major drops in availability or increased error rates during the incident.
  • The PagerDuty web application was slow or inaccessible primarily from 16:00 UTC to 16:20 UTC, during which approximately 84% of traffic was successfully served.
  • The PagerDuty mobile application experienced two period of slowness or failure to load data from 16:00 UTC - 16:20 UTC during which time approximately 82% of requests were served successfully, and again from 16:32 UTC - 16:52 UTC during which time approximately 60.2% of requests were served successfully.
  • Status update notifications were delayed starting from 15:42 UTC and ending at 16:34 UTC

During the incident, the team monitored the recovery of the network infrastructure as well as the performance of our systems, and corrected areas where the network instability had caused backups of work queues.

What We’re Doing

To improve reliability in the future, we will:

  • Make revisions to our disaster recovery procedure based on learnings from this incident
  • Make revisions to our incident response procedure to better account for network infrastructure issues
  • Set up additional monitors to alert us of issues making connections into our systems.

For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted Dec 23, 2021 - 20:46 UTC

Resolved
All PagerDuty systems are back to a normal operating state. This incident is now resolved/closed.
Posted Dec 15, 2021 - 17:22 UTC
Update
We are seeing significant signs of recovery and are working to process any events or notifications that may have been delayed during the issue.
Posted Dec 15, 2021 - 16:52 UTC
Monitoring
We are seeing some signs of improvement and are continuing to work with our providers to restore service for affected users.
Posted Dec 15, 2021 - 16:23 UTC
Identified
We have confirmed a network connectivity issue which is affecting customers' ability to connect to the PagerDuty US Service region. We continue to work with our providers to mitigate or restore service for affected users. Our support portal is also affected by this, so customers may get delayed response from our support team.
Posted Dec 15, 2021 - 15:56 UTC
Update
We are continuing to investigate issues affecting customer's ability to reach the PagerDuty application in the US Service Region.
Posted Dec 15, 2021 - 15:45 UTC
Update
We are continuing to investigate this issue.
Posted Dec 15, 2021 - 15:37 UTC
Investigating
We are currently investigating issues with accessing PagerDuty accounts, will update as we will know more.
Posted Dec 15, 2021 - 15:35 UTC
This incident affected: Notification Delivery (Notification Delivery (US)), Web Application (Web Application (US), Web Application (EU)), and Mobile Application (Mobile Application (US)).