Investigating Potential Issue
Incident Report for PagerDuty
Postmortem

Summary

On October 9, 2022, from 16:12 UTC to 16:42 UTC, PagerDuty experienced a failure in the event dispatching endpoint and its ability to process event data for one of our US region's inbound integrations.

What Happened

During the time of the impact, one of the components in the data pipeline experienced a spike in resource usage that forced it to stop processing part of the incoming event data. Events sent to us and destined for a specific global endpoint (“X-ERE”) failed, returning 500 responses. The system is designed to automatically recover from this type of error state, and in fact, has done so regularly in the past. However, in this instance the automated recovery did not occur, resulting in an error state for this endpoint service. After a manual restart, the service recovered as expected and returned to a healthy state, and we resumed processing events fully as of 16:42 UTC.

What are we doing about this?

We are actively working on making our pipeline resilient against a similar/related issue so that such issues would not cause a degradation of our services. The team continues to investigate the reasons why the automated recovery did not trigger in this case, and other edge cases to make sure in a future situation the system will recover automatically. For any questions, comments, or concerns, please contact us at support@pagerduty.com

Posted Oct 24, 2022 - 21:38 UTC

Resolved
We have resolved an incident where X-ERE endpointed Global Events in the US region were non-functioning. The incident is now resolved, and there is no ongoing impact to customers. The period of impact was from approximately 16:12-16:46 GMT, Oct. 9th 2022.
Please reach out to support@pagerduty.com if you have any concerns.
Posted Oct 09, 2022 - 16:53 UTC
Identified
We are continuing to investigate an incident where X-ERE Global Events in the US Region are currently non-functional. (Non-X-ERE Global Events continue to function normally.) We will provide further updates within 20 minutes.
Posted Oct 09, 2022 - 16:42 UTC
Investigating
We are investigating reports of a potential issue within PagerDuty. On confirmation, we will update with impact and severity within 15 minutes.
Posted Oct 09, 2022 - 16:31 UTC
This incident affected: Events API (Events API (US)).