On October 9, 2022, from 16:12 UTC to 16:42 UTC, PagerDuty experienced a failure in the event dispatching endpoint and its ability to process event data for one of our US region's inbound integrations.
During the time of the impact, one of the components in the data pipeline experienced a spike in resource usage that forced it to stop processing part of the incoming event data. Events sent to us and destined for a specific global endpoint (“X-ERE”) failed, returning 500 responses. The system is designed to automatically recover from this type of error state, and in fact, has done so regularly in the past. However, in this instance the automated recovery did not occur, resulting in an error state for this endpoint service. After a manual restart, the service recovered as expected and returned to a healthy state, and we resumed processing events fully as of 16:42 UTC.
We are actively working on making our pipeline resilient against a similar/related issue so that such issues would not cause a degradation of our services. The team continues to investigate the reasons why the automated recovery did not trigger in this case, and other edge cases to make sure in a future situation the system will recover automatically. For any questions, comments, or concerns, please contact us at support@pagerduty.com