Event Ingestion Issue
Incident Report for PagerDuty
Postmortem

Summary 

On October 31, 2022, from approximately 22:15 UTC until 22:40 UTC, a few customers in the US service region received 500 errors for events sent to the Events API. Events receiving errors were retried successfully within minutes. Webhooks, notifications, inbound email events, and the REST API were not impacted at all by this incident.

What Happened 

The errors were precipitated by the phased rollout of a configuration change, which began at 21:02 UTC and continued until 22:40 UTC when the rollout was paused. During this rollout hosts were marked as "healthy" but were repeatedly restarting, causing allocations to fail. Those allocations were transferred to other functioning hosts, so indicators of degradation in our system did not appear until 22:25 UTC when a monitor for excessive failed allocations was triggered. As the configuration change continued to roll out, additional hosts began to repeatedly restart. Customer impact began at 22:15 UTC when enough hosts were restarting that there were insufficient healthy hosts to process all Events API requests. Alerts notified responders, and a major incident was triggered at 22:25 UTC after the first few 500 errors. The deployment was paused and failed events were immediately re-queued for processing. At 22:40 UTC, we observed recovery and an end to delayed customer events. The configuration change was rolled forward to a known good configuration and additional hosts were provisioned. Clean-up actions and decommissioning affected hosts continued until 02:42 UTC November 1st.

What We Are Doing About This

We have identified the change that caused hosts to restart and have removed it from the host configuration. We also plan to improve our monitoring for failed allocations to catch this issue before it becomes customer-impacting in future rollouts. We apologize for the inconvenience that this has caused. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Nov 08, 2022 - 01:43 UTC

Resolved
We can confirm the resolution of an issue which briefly resulted in a small number of customers receiving 500 HTTP errors in response to events sent to the Events API. There is no ongoing impact to customers. Please reach out to support@pagerduty.com if you have any concerns.
Posted Nov 01, 2022 - 00:21 UTC
Update
We continue to observe no further customer impact. We are continuing to monitor and pursue remediation of the identified cause. We will provide another update within 30 minutes.
Posted Nov 01, 2022 - 00:07 UTC
Monitoring
There continues to be no further impact to the processing of events. We are continuing to monitor and pursue remediation of the identified cause. We will provide another update within 30 minutes.
Posted Oct 31, 2022 - 23:38 UTC
Identified
We are no longer observing any failures of events received through the Events API; we are continuing to investigate the issue. We will provide another update within 30 minutes.
Posted Oct 31, 2022 - 23:11 UTC
Update
We are investigating an incident where a small number of PagerDuty customers in the US Service Region are experiencing issues with sending events to the Events API. Impacted customers may see HTTP 500 status code when sending events. We will provide further updates within 20 minutes.
Posted Oct 31, 2022 - 22:50 UTC
Investigating
We are investigating potential issues with event ingestion within PagerDuty US service region. On confirmation, we will update with further impact and severity within 15 minutes
Posted Oct 31, 2022 - 22:37 UTC
This incident affected: Events API (Events API (US)).