On January 6, 2017, starting at 22:50 UTC and lasting approximately 80 minutes, PagerDuty experienced a service degradation related to Incident and Alert details, and to a lesser extent, Infrastructure Health and Event Rules. During this time, customers may have experienced missing information, errors, and application crashes when viewing Incident and Alert detail pages. Customers may also have noticed that data in Infrastructure Health was delayed, and that Event Rules were not applied to events ingested via integrations created during the degradation. Incident notifications were also missing some information.
No Incident notifications were lost or delayed during this degradation.
Starting at 22:50, a very large and unexpected increase in traffic caused the memory usage on one of our infrastructure services to increase beyond its expected value. Because this service is run using Linux cgroups with limited memory, the Linux OOM Killer was automatically invoked to kill the service. The service was automatically restarted, only to be killed immediately by the OOM killer, and this cycle repeated. The service in question plays a role in creating Incident and Alert detail data. It is also used in relation to Event Rules.
This failing service is run using a Mesos cluster, one that is shared with Infrastructure Health services. Unfortunately, because of the fast cycle of service restarts and OOM killing, a Linux kernel bug was triggered on all of our Mesos slaves at different but partially overlapping times. The bug caused the slaves to freeze up entirely, meaning the hosted services were effectively not running. For more details on the bug, see Linux kernel commit logs here and here. Having the Mesos cluster in an unavailable state for a short period of time meant that Infrastructure Health data processing was slightly delayed.
The failing service was returned to a healthy state by increasing the limit on its memory usage. The frozen Mesos slaves were force-rebooted, bringing the cluster back to its normal state.
This degradation was complex and we are ensuring that it does not happen again by making changes at multiple levels. We are upgrading the Linux kernel used on our Mesos slaves so that the OOM Killer bug is no longer present - this ensures that a failure of one service cannot impact other services. We are also investigating in more detail what the proper memory limit should be for the failing service, and ensuring that it does not use memory in an unbounded fashion. Finally, we are making changes to our various UIs so that they degrade more gracefully in the event that Incident or Alert detail data is not present.
We sincerely apologize if this degradation negatively impacted your team's visibility or response. We know that Incident details are an important part of our customer's incident resolution process. The steps outlined above should prevent this type of issue from escalating to a degradation in the future. If you have questions or concerns please contact us at email@example.com.