Inbound event processing delays
Incident Report for PagerDuty
Postmortem

Summary

On June 4 from 15:42 UTC to 17:48 UTC, PagerDuty experienced a degradation in its ability to process event data for inbound integrations. As a result, incident creation and notifications were delayed for some customers.

What Happened?

During the time of impact, there was an increased rate of unusually large messages incoming to our systems. These messages were handled correctly, but caused internal message publishing latencies to increase. One component of our data processing pipeline in particular was not able to handle the increase in latency, which led to degraded performance and a backlog of events and notifications for some customers.

During the incident, internal event traffic was shifted to a secondary processing component. This allowed the degraded component to clear its backlog of event data.

What Are We Doing About This?

The component that experienced increased latency is actively being replaced with the secondary system. A significant portion of traffic has been moved to this new system since this incident, and we will complete this transition in the coming weeks.

We will also be updating our systems to more strictly enforce a global 512-kilobyte message payload size limit for incoming event data traffic. Exceeding the limit will be visible to clients as status 400 responses from the Events API and email bouncebacks for email-based integrations.

We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com

Posted 6 months ago. Jun 08, 2018 - 20:49 UTC

Resolved
We have identified the issue and taken steps to remediate it. Notifications and inbound event data are being processed normally. All systems are now operational. We are continuing to monitor the situation for any recurrence.
Posted 7 months ago. Jun 04, 2018 - 18:16 UTC
Investigating
PagerDuty is currently experiencing an issue processing inbound event data, and notifications are delayed. Our engineering team is investigating and taking action.
Posted 7 months ago. Jun 04, 2018 - 16:32 UTC
This incident affected: Notification Delivery.