Issue affecting Events Ingestion
Incident Report for PagerDuty
Postmortem

Summary

On the morning of March 7, from approximately 4:45AM UTC to 6AM UTC, PagerDuty experienced an incident related to event processing and incident creation. During this time, we experienced issues with our events API, as well as delays in incident creation and as a result, notification processing.

What Happened

Our engineers were alerted of notification processing delays at 4:50AM UTC. As an initial remediation step, the associated service was redeployed, however, we did not see recovery after this action. We isolated the issue to one of our Kafka clusters, which was operating in a degraded state due to underlying hardware issues affecting our servers. As soon as we discovered this, we began the process of replacing the failing servers. The underlying server issues had recovered, at which point customer impact had stopped. We continued the replacement process to completion in order to bring the Kafka cluster back to a good state.

Customers would have seen issues for approximately 1 hour from when the hardware degradation began. There was no impact to our systems after this point as we continued to replace the failing servers.

Next Steps

Our engineering team is going to investigate the reasons why these underlying instance issues impacted our event to notification pipeline. We expect to be resilient to these issues and will investigate why weren’t in this case and what we can do to remedy this deficiency going forward.

We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Mar 13, 2020 - 21:27 UTC

Resolved
We have recovered. All systems are operating normally.
Posted Mar 07, 2020 - 08:25 UTC
Monitoring
We have identified the issue and have taken steps to mitigate it. We are currently seeing signs of recovery.
Posted Mar 07, 2020 - 05:55 UTC
Investigating
We're currently experiencing an issue with Event Ingestion, and our Engineering team is currently investigating.
Posted Mar 07, 2020 - 05:29 UTC
This incident affected: Events API.