[configure] Cleaning the container runtime ephemeral storage on a node by jing2uo · Pull Request #687 · alauda/knowledge

jing2uo · 2026-05-02T07:43:40Z

新增一篇 ACP KB 文章，归入 configure 区域。

✅ 自动化验证通过 — 可自动合并 — 2 / 2 条验证步骤在真实 Kubernetes 集群上按文章命令跑通（2026-05-02T16:35:29Z）。

`configure` 区域建议 reviewer

按 kb/OWNERS.md（来源：alauda-ai-base operator-list 的产品 owner）该区域候选自动挑选，@ 错了请无视。

没有 GitHub handle 的贡献者（本区域相关请人工 ping）：

gangwang <gangwang@alauda.io>
xdzhang <xdzhang@alauda.io>

coderabbitai · 2026-05-02T07:43:50Z

Warning

Rate limit exceeded

@jing2uo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 46 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4afb54d-311b-4fff-b6ba-67ec92a82c61

📥 Commits

Reviewing files that changed from the base of the PR and between 9f27797 and de78a16.

📒 Files selected for processing (1)

docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md

Walkthrough

A new troubleshooting guide documents how to recover nodes when container runtime overlay storage becomes corrupted and prevents pod sandbox creation. The guide presents two remediation paths—in-place cleanup and reboot-based recovery—along with diagnostic steps to identify and debug the issue.

Changes

Container Runtime Ephemeral Storage Recovery Guide

Layer / File(s)	Summary
Problem Definition `docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md` (lines 2–20)	Document metadata, issue scope, and observable runtime symptoms (failed pod sandbox creation, cleanup errors, crashes, image pull failures).
Remediation Overview `docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md` (lines 21–24)	Two remediation variants introduced: in-place cleanup (preferred) with fallback to reboot-based recovery.
In-place Procedure `docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md` (lines 25–65)	Step-by-step instructions to cordon/drain, halt kubelet, remove pod sandboxes (preserving host-network pods) via `crictl`, wipe storage, and restart runtime without node reboot.
Reboot-based Procedure `docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md` (lines 66–114)	Alternative flow: disable kubelet, reboot, stop runtime services, remove `/var/lib/containers/*`, run `crio wipe -f`, restart services, re-enable kubelet, verify `Ready` state, and uncordon.
Diagnostic Steps `docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md` (lines 115–140)	Guidance to identify affected nodes via pod events, collect runtime and kubelet logs, verify overlay state consistency, and capture `SIGABRT` backtraces for upstream debugging.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

chinaran

Poem

📋 A troubleshooting scroll unfurls with care,
Two paths to heal the storage layer rare,
When pods can't sandbox in the overlay deep,
We cordon, drain, and make the runtime sleep,
Then wipe the /var and spring back to the light! 🐇✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding documentation about cleaning container runtime ephemeral storage on nodes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kb/2026-05/cleaning-the-container-runtime-ephemeral

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 16 minutes and 46 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md`:
- Line 95: The sentence advising to "stop it explicitly first" is ambiguous
relative to the earlier "stop the runtime" step; update the text around the line
that mentions "If `rm -rvf` returns `Device or resource busy` on
`/var/lib/containers/storage/overlay`" to explicitly say what to do: explain
that the runtime may have been auto-restarted by systemd or not fully terminated
and instruct the reader to confirm the service state and use an explicit
termination method (e.g., a full stop plus kill/terminate to clear remaining
processes) and to check for systemd restart policies before retrying removal;
reference the earlier stop step and add a short note to verify mounts are
released (e.g., check for lingering processes or mounts) before re-running the
rm command.
- Line 64: The in-place workflow stops after wiping /var/lib/containers but
omits the remaining recovery steps; update the section that references "step 4
of the reboot path" to append explicit instructions to restart the container
runtime (systemctl start crio.service or containerd.service), start the kubelet
(systemctl start kubelet.service), wait for the node to report Ready (kubectl
get node <node>), and uncordon the node (kubectl uncordon <node>), and include a
note to retry the rm -rvf if mounts persist and to run crio wipe -f for CRI-O
where applicable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05808100-17cd-4ed0-800a-6069392d490f

📥 Commits

Reviewing files that changed from the base of the PR and between 6f3336c and 9f27797.

📒 Files selected for processing (1)

docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md

coderabbitai · 2026-05-02T07:47:07Z

+
+   - On a containerd node, the same logic with `crictl` is portable because `crictl` talks the CRI API regardless of runtime.
+
+5. Continue with step 4 of the reboot path below (wipe `/var/lib/containers/`, restart the runtime).


⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Complete the in-place workflow with kubelet restart and uncordon steps.

The in-place path ends by referencing step 4 of the reboot path (wipe storage), but it doesn't include the subsequent steps needed to bring the node back into service:

Re-enabling and starting the kubelet (reboot path step 5)

Waiting for the node to reach Ready state (step 6)

Uncordoning the node (step 7)

Without these steps, the operator will have a cordoned node with a clean runtime but no clear instructions to complete the recovery.

📝 Suggested completion of the in-place workflow

After line 64, add explicit steps:

-5. Continue with step 4 of the reboot path below (wipe `/var/lib/containers/`, restart the runtime). +5. Wipe the runtime's storage and restart it: + + ```bash + systemctl stop crio.service # or: systemctl stop containerd.service + rm -rvf /var/lib/containers/* + crio wipe -f # CRI-O only; containerd has no equivalent + systemctl start crio.service # or: systemctl start containerd.service + ``` + + If `rm -rvf` returns `Device or resource busy`, the runtime may still hold mounts. Ensure it is fully stopped, then retry. + +6. Start the kubelet: + + ```bash + systemctl start kubelet.service + ``` + +7. Wait a few minutes, then confirm from a control-plane host that the node has returned to `Ready`: + + ```bash + kubectl get node <node> + ``` + +8. Uncordon the node so that the scheduler may place pods on it again: + + ```bash + kubectl uncordon <node> + ```

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md` at line 64, The in-place workflow stops after wiping /var/lib/containers but omits the remaining recovery steps; update the section that references "step 4 of the reboot path" to append explicit instructions to restart the container runtime (systemctl start crio.service or containerd.service), start the kubelet (systemctl start kubelet.service), wait for the node to report Ready (kubectl get node <node>), and uncordon the node (kubectl uncordon <node>), and include a note to retry the rm -rvf if mounts persist and to run crio wipe -f for CRI-O where applicable.

jing2uo temporarily deployed to translate May 2, 2026 07:43 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

jing2uo temporarily deployed to translate May 2, 2026 13:43 — with GitHub Actions Inactive

jing2uo temporarily deployed to translate May 2, 2026 16:35 — with GitHub Actions Inactive

[configure] Cleaning the container runtime ephemeral storage on a node

de78a16

jing2uo force-pushed the kb/2026-05/cleaning-the-container-runtime-ephemeral branch from c8c5349 to de78a16 Compare May 2, 2026 16:47

jing2uo temporarily deployed to translate May 2, 2026 16:47 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[configure] Cleaning the container runtime ephemeral storage on a node#687

[configure] Cleaning the container runtime ephemeral storage on a node#687
jing2uo wants to merge 1 commit intomainfrom
kb/2026-05/cleaning-the-container-runtime-ephemeral

jing2uo commented May 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		- On a containerd node, the same logic with `crictl` is portable because `crictl` talks the CRI API regardless of runtime.

		5. Continue with step 4 of the reboot path below (wipe `/var/lib/containers/`, restart the runtime).

Conversation

jing2uo commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

configure 区域建议 reviewer

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jing2uo commented May 2, 2026 •

edited

Loading

`configure` 区域建议 reviewer

coderabbitai Bot commented May 2, 2026 •

edited

Loading