From 0d9e644dff673ed9a8982a1762b281d05df9a08a Mon Sep 17 00:00:00 2001 From: Komh Date: Sat, 2 May 2026 09:42:08 +0000 Subject: [PATCH 1/2] [configure] Troubleshooting nodes stuck in NotReady --- ...Troubleshooting_nodes_stuck_in_NotReady.md | 132 ++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md diff --git a/docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md b/docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md new file mode 100644 index 00000000..9b506097 --- /dev/null +++ b/docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md @@ -0,0 +1,132 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +A node reports `Status: NotReady` in `kubectl get nodes`. Pods scheduled on the node go into `Unknown` or `Terminating`; new pods cannot be placed on it; existing workloads on the node may keep running for a while but become invisible to the cluster. + +The same surface symptom (`NotReady`) covers very different underlying problems. The two most common roots are: + +- The `kubelet` process on the node is **not running** at all (crashed, stopped, or its dependency — the container runtime — is down). +- The `kubelet` process is running but **cannot reach the API server**, so it cannot post status heartbeats and the apiserver downgrades the node after the configured grace window. + +This article walks through how to distinguish the two, and how to fix each. + +## Resolution + +### 1. Confirm the node is genuinely NotReady + +From a control-plane host, look at the node and the conditions block: + +```bash +kubectl get node +kubectl describe node | sed -n '/Conditions:/,/Addresses:/p' +``` + +The output shows one row per condition; the relevant ones are `Ready`, `MemoryPressure`, `DiskPressure`, `PIDPressure`, `NetworkUnavailable`. If `Ready=Unknown`, the kubelet is not posting heartbeats — focus on connectivity (step 4). If `Ready=False` with a specific message, the kubelet is reporting the failure itself — focus on the kubelet (step 2 and 3). + +### 2. Inspect the kubelet on the node + +Open a shell on the affected node — either over SSH, or via a privileged debug pod with a host shell: + +```bash +kubectl debug node/ -it --profile=sysadmin --image= +chroot /host +``` + +Then check the kubelet unit: + +```bash +systemctl status kubelet +journalctl -u kubelet -n 200 --no-pager +``` + +If the unit is `inactive` or repeatedly restarting: + +- Look for `failed to start container manager` or `failed to start kubelet` errors with the runtime endpoint mentioned — that points at the container runtime (step 3). +- Look for `runtime is unable to ...` or `node not found` messages — that points at apiserver connectivity (step 4). +- Look for `out of memory` / `killed` messages — the node ran out of resources; clean up before restarting kubelet. + +If the unit is healthy, restart it once to clear transient state: + +```bash +systemctl restart kubelet +journalctl -u kubelet -f +``` + +### 3. Inspect the container runtime + +The kubelet depends on a CRI runtime (typically `cri-o` or `containerd`). If the runtime is down, the kubelet refuses to declare the node Ready: + +```bash +systemctl status crio # or: systemctl status containerd +journalctl -u crio -n 200 # or: journalctl -u containerd -n 200 +crictl info | head -40 +``` + +A failing runtime is usually one of: + +- Disk pressure: `/var/lib/containers/` (cri-o) or `/var/lib/containerd/` is full. +- A leaked container blocking startup: clean it with `crictl rm -f `. +- A configuration drift: confirm `/etc/crio/crio.conf` (or `/etc/containerd/config.toml`) matches the rest of the fleet. + +After restoring the runtime, the kubelet usually recovers on its own; if not, restart it explicitly. + +### 4. Confirm kubelet-to-apiserver connectivity + +Even when the kubelet is running, the node only reports `Ready` if it can post heartbeats to the apiserver. From the affected node: + +```bash +# From the node, hit the apiserver URL the kubelet uses +curl -k --resolve api.:6443: \ + https://api.:6443/livez +``` + +Failures here mean a network problem between the node and the control plane: cluster VIP, route, MTU, firewall, or TLS expiry. Resolve the underlying network issue, then watch the kubelet log come unblocked. If the kubelet client certificate has expired (look for `x509: certificate has expired or is not yet valid` in the kubelet log), rotate it via the cluster's certificate-rotation mechanism before the kubelet can re-register. + +### 5. Verify the node returns to Ready + +Once the underlying cause is addressed, watch the node from a control-plane host: + +```bash +kubectl get node -w +``` + +The node should transition `NotReady` → `Ready` within one or two heartbeat intervals (a few tens of seconds). If it does not, repeat steps 2–4 with the latest logs — multiple failures often stack (for example, a runtime that came back but with a stale image cache that breaks the next pod sandbox). + +## Diagnostic Steps + +1. List nodes and their last-heartbeat times: + + ```bash + kubectl get nodes -o wide + kubectl get node -o jsonpath='{.status.conditions[?(@.type=="Ready")]}{"\n"}' + ``` + +2. Capture the apiserver's view of the node's recent events: + + ```bash + kubectl describe node | sed -n '/Events:/,$p' + ``` + +3. From the node itself, capture kubelet and runtime state for the symptomatic window: + + ```bash + journalctl -u kubelet --since "30 minutes ago" + journalctl -u crio --since "30 minutes ago" # or containerd + ``` + +4. If suspecting connectivity, trace the path from node to apiserver: + + ```bash + ip route get + ss -tnp | grep + curl -kv --max-time 5 https://:6443/livez + ``` + +5. If the same node keeps flipping `Ready ↔ NotReady`, look at memory and PID pressure on the node — kubelet evicts itself from Ready when the node crosses the eviction thresholds, so a node that flaps is sometimes simply oversubscribed. From f830a178f890b34bf977a39489357d2ce46af85e Mon Sep 17 00:00:00 2001 From: Komh Date: Sat, 2 May 2026 13:12:51 +0000 Subject: [PATCH 2/2] [configure] Troubleshooting nodes stuck in NotReady --- docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md b/docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md index 9b506097..85b948c2 100644 --- a/docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md +++ b/docs/en/solutions/Troubleshooting_nodes_stuck_in_NotReady.md @@ -6,6 +6,8 @@ products: ProductsVersion: - 4.1.0,4.2.x --- + +# Troubleshooting nodes stuck in NotReady ## Issue A node reports `Status: NotReady` in `kubectl get nodes`. Pods scheduled on the node go into `Unknown` or `Terminating`; new pods cannot be placed on it; existing workloads on the node may keep running for a while but become invisible to the cluster.