API network connection issues
Incident Report for PagerDuty
Postmortem

Summary 

On August 31st, between 19:00 UTC and 21:16 UTC, we experienced a degradation in the US Service Region during which connections to the PagerDuty website and APIs failed intermittently, and some outbound notifications were delayed. Our EU Service Region was not affected.  

What happened?

At around 19:00 UTC on August 31st, Amazon Web Services began to experience networking issues with a single Availability Zone in the us-west-2 region. This led to intermittent availability of our load balancers in the affected Availability Zone. From 19:23 UTC onwards, we began to manually shift traffic away from the affected Availability Zone, and by 21:16 UTC we were confident that the issue had been mitigated.

The nature of the partial outage in the affected Availability Zone resulted in the intermittent availability problems. As a result, it took longer than anticipated to fully understand the customer impact. 

What are we doing about this?

We are updating our alerts and monitoring to have better visibility when these kinds of problems occur. This will ensure we are better placed to understand the impact should a similar problem occur in the future.

We understand how important and critical our platform is for our customers. We apologize for any impact this incident had on you and your teams. As always, we stand by our commitment to provide the most reliable and resilient platform in the industry. If you have any questions, please reach out to support@pagerduty.com.

Posted Sep 30, 2021 - 20:43 UTC

Resolved
This incident has been resolved.
Posted Aug 31, 2021 - 22:13 UTC
Update
The implemented fix succeeded and we have fully recovered.
Posted Aug 31, 2021 - 22:12 UTC
Monitoring
We have implemented a fix and are seeing some signs of recovery. We are continuing to monitor.
Posted Aug 31, 2021 - 21:58 UTC
Identified
Webhook delivery is also impacted, such that some webhooks may be delivered late. Aside from webhook delivery, for all events that we are successfully receiving, notifications are still being delivered within SLA. We are continuing to investigate possible solutions.
Posted Aug 31, 2021 - 21:00 UTC
Investigating
A portion of PagerDuty users are experiencing network connection issues when connecting to PagerDuty’s APIs. We are investigating possible solutions.
Posted Aug 31, 2021 - 20:39 UTC
This incident affected: REST API (REST API (US)), Webhooks (Webhooks (US)), and Events API (Events API (US)).