Incident Log Entries not Displaying or Processing
Incident Report for PagerDuty
Postmortem

Summary

On December 6, between 14:42 and 15:50 UTC, we experienced an issue where Incident Log Entries were not being displayed or processed. This affected displaying Incident Log Entries within the Web Application, within email notifications, and also affected the Log Entries API endpoint. By 15:51 UTC the backlog of Incident Log Entries had been processed successfully.

What Happened

The issue was triggered when a bug fix was deployed to a separate service, leveraging a shared data cluster that was also used by the service populating Incident Log Entries. The bug fix enabled a stalled data processing pipeline for that service to resume, causing a large amount of data to be read from this shared data cluster. This, in turn, ended up triggering a quota enforcement mechanism, preventing the service responsible for populating Incident Log Entries from being able to effectively consume fresh data. Due to a bug in the client libraries we use within these services, the quota was shared between the two services rather than enforced independently.

At 15:38 UTC, a solution was deployed which instantly showed signs of recovery. The processing and displaying of incident log entries was constantly monitored until it had fully recovered at 15:50 UTC.

What Are We Doing About This

We have deployed a fix to the third party library we use that will allow us to configure the quotas independently for different services. This allows us to have greater observability and control over the consumption of data from the shared cluster, and prevents a single service from over-consuming or causing degradation of other services.

We would like to express our sincere regret for any inconvenience that resulted from this incident. For any questions, comments, or concerns, please contact us at support@pagerduty.com.

Posted Dec 11, 2019 - 22:41 UTC

Resolved
We've confirmed this issue is resolved. Incident Log Entries that were delayed have now been processed. You should now be able to see the Incident Log entries within the Incident Timeline and within your email notifications. The Log Entries endpoint has also recovered.
Posted Dec 06, 2019 - 15:54 UTC
Monitoring
After deploying a solution, we are now seeing signs of recovery. We are continuing to monitor.
Posted Dec 06, 2019 - 15:48 UTC
Identified
We believe we have determined the source of the issue and are currently reviewing a solution that should mitigate the issue with displaying and processing Incident Log Entries within the Web Application. This issue affects displaying incident details within the incident timeline, as well as the incident details within emails, and the log entries API endpoint.
Posted Dec 06, 2019 - 15:33 UTC
Investigating
We are currently experiencing an issue displaying and processing incident log entries. Our engineering teams are actively investigating this issue and are working on a solution. The Mobile App, Notifications and our APIs are unaffected.
Posted Dec 06, 2019 - 14:57 UTC
This incident affected: Web Application.