Using Events for Effective Troubleshooting in Managed Kubernetes

Kubernetes events play a vital role in offering essential insights into the cluster's activities and overall health. These events serve as a valuable resource for troubleshooting issues and enable effective alerting mechanisms that facilitate prompt remediation actions.

What are events?

Kubernetes events are a central part of the Kubernetes control plane, allowing the cluster components and resources to communicate and record information about various occurrences within the cluster. They are essentially records of incidents, state changes, or observations about the cluster's activities and can be generated by various entities, including controllers, schedulers, and user operations. Essentially, Events are a type of condensed log that is native to Kubernetes.

By default, events have a time to live (TTL) of 1 hour to prevent excessive resource consumption. However, the retention period can be adjusted to meet the specific needs and resource availability of your cluster (note that this may not be applicable for managed services).

While the default shelf life works for current troubleshooting, a historical record can help in post-mortem analysis, alerting, and various other cases. Events are stored in the API server's etcd database, which serves as the cluster's persistent key-value store

Anatomy of a Kubernetes Event: A Kubernetes Event consists of several key components:

  • Type: Indicates the nature of the event, such as Normal or Warning.

  • Reason: Describes why the event occurred, providing context for understanding the event's significance.

  • Object: The Kubernetes resource associated with the event, such as a Pod, Deployment, or Node.

  • Message: Provides additional details about the event, offering more specific information.

  • Timestamp: Indicates when the event occurred, enabling chronological tracking and analysis

  • Source: Identifies the component that generated the event, such as a controller or a user operation.

Accessing Events

You can retrieve Kubernetes events using the command-line tool kubectl or by interacting with the Kubernetes API programmatically. These tools enable you to query and filter events based on your specific requirements, allowing you to effectively monitor and troubleshoot cluster activities. It's important to note that these methods only retrieve events within the default shelf life, which is stored in etcd database

kubectl get events
aritra [ ~ ]$ kubectl get events
LAST SEEN   TYPE      REASON                      OBJECT                                   MESSAGE
45s         Normal    NodeHasSufficientMemory     node/aks-agentpool-40622340-vmss000005   Node aks-agentpool-40622340-vmss000005 status is now: NodeHasSufficientMemory
27m         Warning   EvictionThresholdMet        node/aks-agentpool-40622340-vmss000005   Attempting to reclaim memory
4m20s       Normal    NodeHasSufficientMemory     node/aks-agentpool-40622340-vmss000008   Node aks-agentpool-40622340-vmss000008 status is now: NodeHasSufficientMemory
56m         Normal    NodeHasInsufficientMemory   node/aks-agentpool-40622340-vmss000008   Node aks-agentpool-40622340-vmss000008 status is now: NodeHasInsufficientMemory
16m         Normal    NodeHasSufficientMemory     node/aks-agentpool-40622340-vmss000009   Node aks-agentpool-40622340-vmss000009 status is now: NodeHasSufficientMemory
17s         Warning   EvictionThresholdMet        node/aks-agentpool-40622340-vmss000009   Attempting to reclaim memory
2m57s       Normal    NodeHasSufficientMemory     node/aks-agentpool-40622340-vmss00000t   Node aks-agentpool-40622340-vmss00000t status is now: NodeHasSufficientMemory
60m         Warning   EvictionThresholdMet        node/aks-agentpool-40622340-vmss00000t   Attempting to reclaim memory
30s         Normal    NodeHasSufficientMemory     node/aks-agentpool-40622340-vmss00000u   Node aks-agentpool-40622340-vmss00000u status is now: NodeHasSufficientMemory
26m         Warning   EvictionThresholdMet        node/aks-agentpool-40622340-vmss00000u   Attempting to reclaim memory

To get a list of events useful for troubleshooting, you can filter out the Normal events

kubectl get events --field-selector type!=Normal

You can also restrict to a certain node or pod in the list

kubectl get events --field-selector=involvedObject.kind=Node,involvedObject.name=<node_name>

\Events are also accessible using the Kubernetes API '/api/v1/events'. Similar filters exist to retrieve only interesting events

In the next section, we will explore how to enable/access longer term events in the major managed Kubernetes services

Enabling Events in AKS, EKS, GKE

AKS

Events are stored as part of Container Insights. Once you enable Container Insights on your cluster, Events are available to query from the Logs

I will now outline the steps for enabling Container Insights on a pre-existing cluster. Note that Container Insights can also be configured during the initial cluster creation process

  1. On the Azure Portal, click on Insights under Monitoring on the left pane.

  2. Click on Configure monitoring butoon, which will open a frame on the right side. Next, click on 'Configure' located at the bottom left of the pane. You can leave 'Managed Prometheus' and 'Managed Grafana' unchecked

  3. After Container Insights is enabled, click on Logs on the left pane.

  4. Search for "Events" and click on the pane Kubernetes Events

  5. This will run the default query and list all the events in the last 7 days. Note that events of type "Normal" are not stored by default. This behavior can be changed through a configmap. The events and logs are stored in the configured Log analytics workspace.

GKE

GKE stores Events by default without any need for configuration. Kubernetes Events view is available in the Observability tab for the cluster. A number of filtering options are available in addition to Explore in Monitoring, which can be used to create custom queries

EKS

EKS setup for events is more complex as compared to the two other cloud providers. It requires the deployment of a Container Insights image the cluster through which the logs will be stored and accessible in Cloud Watch. The detailed blog is available here

3rd party tools (Datadog, NewRelic, Dynatrace etc)

Most observability tools provide an interface for Kubernetes events, although these events may not persist for a long time (unlike the previous examples)

Using Events for Troubleshooting

Events are useful in triaging a number of issues. Below are a few scenarios where it is particularly effective

  • Autoscaler: When autoscaling is not functioning as intended, examining events from the source component 'cluster-autoscaler' can help identify the root cause of the issue.

  • Upgrade: A cluster upgrade is a complex process that can fail in many ways. During an upgrade, both the control plane and data plane are upgraded. A new node is created with the new version while the old nodes are cordoned off and drained. Events related to upgrade can help understand what stage the upgrade failed to resume from the failed stage

  • Pod Scheduling: Examining all events generated by the 'default-scheduler' component can provide valuable insights into the scheduler's decision-making process and help troubleshoot why certain pods were not scheduled.

  • Node issues: Reviewing events on a specific node can help identify any node-specific issues, such as memory or disk saturation. Additionally, if the Node Problem Detector (NPD) is installed on the node, it can expose more complex events like 'KubeletDown'.

Before resorting to more advanced tools, it's often helpful to review Kubernetes events when troubleshooting issues with your cluster