Skip to content

[configure] Cleaning the container runtime ephemeral storage on a node#687

Open
jing2uo wants to merge 1 commit intomainfrom
kb/2026-05/cleaning-the-container-runtime-ephemeral
Open

[configure] Cleaning the container runtime ephemeral storage on a node#687
jing2uo wants to merge 1 commit intomainfrom
kb/2026-05/cleaning-the-container-runtime-ephemeral

Conversation

@jing2uo
Copy link
Copy Markdown
Collaborator

@jing2uo jing2uo commented May 2, 2026

新增一篇 ACP KB 文章,归入 configure 区域。

✅ 自动化验证通过 — 可自动合并 — 2 / 2 条验证步骤在真实 Kubernetes 集群上按文章命令跑通(2026-05-02T16:35:29Z)。

configure 区域建议 reviewer

kb/OWNERS.md(来源:alauda-ai-base operator-list 的产品 owner)该区域候选自动挑选,@ 错了请无视。

没有 GitHub handle 的贡献者(本区域相关请人工 ping):

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

Warning

Rate limit exceeded

@jing2uo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 46 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4afb54d-311b-4fff-b6ba-67ec92a82c61

📥 Commits

Reviewing files that changed from the base of the PR and between 9f27797 and de78a16.

📒 Files selected for processing (1)
  • docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md

Walkthrough

A new troubleshooting guide documents how to recover nodes when container runtime overlay storage becomes corrupted and prevents pod sandbox creation. The guide presents two remediation paths—in-place cleanup and reboot-based recovery—along with diagnostic steps to identify and debug the issue.

Changes

Container Runtime Ephemeral Storage Recovery Guide

Layer / File(s) Summary
Problem Definition
docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md (lines 2–20)
Document metadata, issue scope, and observable runtime symptoms (failed pod sandbox creation, cleanup errors, crashes, image pull failures).
Remediation Overview
docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md (lines 21–24)
Two remediation variants introduced: in-place cleanup (preferred) with fallback to reboot-based recovery.
In-place Procedure
docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md (lines 25–65)
Step-by-step instructions to cordon/drain, halt kubelet, remove pod sandboxes (preserving host-network pods) via crictl, wipe storage, and restart runtime without node reboot.
Reboot-based Procedure
docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md (lines 66–114)
Alternative flow: disable kubelet, reboot, stop runtime services, remove /var/lib/containers/*, run crio wipe -f, restart services, re-enable kubelet, verify Ready state, and uncordon.
Diagnostic Steps
docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md (lines 115–140)
Guidance to identify affected nodes via pod events, collect runtime and kubelet logs, verify overlay state consistency, and capture SIGABRT backtraces for upstream debugging.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • chinaran

Poem

📋 A troubleshooting scroll unfurls with care,
Two paths to heal the storage layer rare,
When pods can't sandbox in the overlay deep,
We cordon, drain, and make the runtime sleep,
Then wipe the /var and spring back to the light! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding documentation about cleaning container runtime ephemeral storage on nodes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kb/2026-05/cleaning-the-container-runtime-ephemeral

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 16 minutes and 46 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md`:
- Line 95: The sentence advising to "stop it explicitly first" is ambiguous
relative to the earlier "stop the runtime" step; update the text around the line
that mentions "If `rm -rvf` returns `Device or resource busy` on
`/var/lib/containers/storage/overlay`" to explicitly say what to do: explain
that the runtime may have been auto-restarted by systemd or not fully terminated
and instruct the reader to confirm the service state and use an explicit
termination method (e.g., a full stop plus kill/terminate to clear remaining
processes) and to check for systemd restart policies before retrying removal;
reference the earlier stop step and add a short note to verify mounts are
released (e.g., check for lingering processes or mounts) before re-running the
rm command.
- Line 64: The in-place workflow stops after wiping /var/lib/containers but
omits the remaining recovery steps; update the section that references "step 4
of the reboot path" to append explicit instructions to restart the container
runtime (systemctl start crio.service or containerd.service), start the kubelet
(systemctl start kubelet.service), wait for the node to report Ready (kubectl
get node <node>), and uncordon the node (kubectl uncordon <node>), and include a
note to retry the rm -rvf if mounts persist and to run crio wipe -f for CRI-O
where applicable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05808100-17cd-4ed0-800a-6069392d490f

📥 Commits

Reviewing files that changed from the base of the PR and between 6f3336c and 9f27797.

📒 Files selected for processing (1)
  • docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md


- On a containerd node, the same logic with `crictl` is portable because `crictl` talks the CRI API regardless of runtime.

5. Continue with step 4 of the reboot path below (wipe `/var/lib/containers/`, restart the runtime).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Complete the in-place workflow with kubelet restart and uncordon steps.

The in-place path ends by referencing step 4 of the reboot path (wipe storage), but it doesn't include the subsequent steps needed to bring the node back into service:

  • Re-enabling and starting the kubelet (reboot path step 5)
  • Waiting for the node to reach Ready state (step 6)
  • Uncordoning the node (step 7)

Without these steps, the operator will have a cordoned node with a clean runtime but no clear instructions to complete the recovery.

📝 Suggested completion of the in-place workflow

After line 64, add explicit steps:

-5. Continue with step 4 of the reboot path below (wipe `/var/lib/containers/`, restart the runtime).
+5. Wipe the runtime's storage and restart it:
+
+   ```bash
+   systemctl stop crio.service          # or:  systemctl stop containerd.service
+   rm -rvf /var/lib/containers/*
+   crio wipe -f                          # CRI-O only; containerd has no equivalent
+   systemctl start crio.service          # or:  systemctl start containerd.service
+   ```
+
+   If `rm -rvf` returns `Device or resource busy`, the runtime may still hold mounts. Ensure it is fully stopped, then retry.
+
+6. Start the kubelet:
+
+   ```bash
+   systemctl start kubelet.service
+   ```
+
+7. Wait a few minutes, then confirm from a control-plane host that the node has returned to `Ready`:
+
+   ```bash
+   kubectl get node <node>
+   ```
+
+8. Uncordon the node so that the scheduler may place pods on it again:
+
+   ```bash
+   kubectl uncordon <node>
+   ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md`
at line 64, The in-place workflow stops after wiping /var/lib/containers but
omits the remaining recovery steps; update the section that references "step 4
of the reboot path" to append explicit instructions to restart the container
runtime (systemctl start crio.service or containerd.service), start the kubelet
(systemctl start kubelet.service), wait for the node to report Ready (kubectl
get node <node>), and uncordon the node (kubectl uncordon <node>), and include a
note to retry the rm -rvf if mounts persist and to run crio wipe -f for CRI-O
where applicable.

@jing2uo jing2uo force-pushed the kb/2026-05/cleaning-the-container-runtime-ephemeral branch from c8c5349 to de78a16 Compare May 2, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant