On December 6, between 14:42 and 15:50 UTC, we experienced an issue where Incident Log Entries were not being displayed or processed. This affected displaying Incident Log Entries within the Web Application, within email notifications, and also affected the Log Entries API endpoint. By 15:51 UTC the backlog of Incident Log Entries had been processed successfully.
The issue was triggered when a bug fix was deployed to a separate service, leveraging a shared data cluster that was also used by the service populating Incident Log Entries. The bug fix enabled a stalled data processing pipeline for that service to resume, causing a large amount of data to be read from this shared data cluster. This, in turn, ended up triggering a quota enforcement mechanism, preventing the service responsible for populating Incident Log Entries from being able to effectively consume fresh data. Due to a bug in the client libraries we use within these services, the quota was shared between the two services rather than enforced independently.
At 15:38 UTC, a solution was deployed which instantly showed signs of recovery. The processing and displaying of incident log entries was constantly monitored until it had fully recovered at 15:50 UTC.
We have deployed a fix to the third party library we use that will allow us to configure the quotas independently for different services. This allows us to have greater observability and control over the consumption of data from the shared cluster, and prevents a single service from over-consuming or causing degradation of other services.
We would like to express our sincere regret for any inconvenience that resulted from this incident. For any questions, comments, or concerns, please contact us at support@pagerduty.com.