On June 4 from 15:42 UTC to 17:48 UTC, PagerDuty experienced a degradation in its ability to process event data for inbound integrations. As a result, incident creation and notifications were delayed for some customers.
During the time of impact, there was an increased rate of unusually large messages incoming to our systems. These messages were handled correctly, but caused internal message publishing latencies to increase. One component of our data processing pipeline in particular was not able to handle the increase in latency, which led to degraded performance and a backlog of events and notifications for some customers.
During the incident, internal event traffic was shifted to a secondary processing component. This allowed the degraded component to clear its backlog of event data.
The component that experienced increased latency is actively being replaced with the secondary system. A significant portion of traffic has been moved to this new system since this incident, and we will complete this transition in the coming weeks.
We will also be updating our systems to more strictly enforce a global 512-kilobyte message payload size limit for incoming event data traffic. Exceeding the limit will be visible to clients as status 400 responses from the Events API and email bouncebacks for email-based integrations.
We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at firstname.lastname@example.org