[configure] Cleaning the container runtime ephemeral storage on a node#687
[configure] Cleaning the container runtime ephemeral storage on a node#687
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughA new troubleshooting guide documents how to recover nodes when container runtime overlay storage becomes corrupted and prevents pod sandbox creation. The guide presents two remediation paths—in-place cleanup and reboot-based recovery—along with diagnostic steps to identify and debug the issue. ChangesContainer Runtime Ephemeral Storage Recovery Guide
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 16 minutes and 46 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md`:
- Line 95: The sentence advising to "stop it explicitly first" is ambiguous
relative to the earlier "stop the runtime" step; update the text around the line
that mentions "If `rm -rvf` returns `Device or resource busy` on
`/var/lib/containers/storage/overlay`" to explicitly say what to do: explain
that the runtime may have been auto-restarted by systemd or not fully terminated
and instruct the reader to confirm the service state and use an explicit
termination method (e.g., a full stop plus kill/terminate to clear remaining
processes) and to check for systemd restart policies before retrying removal;
reference the earlier stop step and add a short note to verify mounts are
released (e.g., check for lingering processes or mounts) before re-running the
rm command.
- Line 64: The in-place workflow stops after wiping /var/lib/containers but
omits the remaining recovery steps; update the section that references "step 4
of the reboot path" to append explicit instructions to restart the container
runtime (systemctl start crio.service or containerd.service), start the kubelet
(systemctl start kubelet.service), wait for the node to report Ready (kubectl
get node <node>), and uncordon the node (kubectl uncordon <node>), and include a
note to retry the rm -rvf if mounts persist and to run crio wipe -f for CRI-O
where applicable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 05808100-17cd-4ed0-800a-6069392d490f
📒 Files selected for processing (1)
docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md
|
|
||
| - On a containerd node, the same logic with `crictl` is portable because `crictl` talks the CRI API regardless of runtime. | ||
|
|
||
| 5. Continue with step 4 of the reboot path below (wipe `/var/lib/containers/`, restart the runtime). |
There was a problem hiding this comment.
Complete the in-place workflow with kubelet restart and uncordon steps.
The in-place path ends by referencing step 4 of the reboot path (wipe storage), but it doesn't include the subsequent steps needed to bring the node back into service:
- Re-enabling and starting the kubelet (reboot path step 5)
- Waiting for the node to reach
Readystate (step 6) - Uncordoning the node (step 7)
Without these steps, the operator will have a cordoned node with a clean runtime but no clear instructions to complete the recovery.
📝 Suggested completion of the in-place workflow
After line 64, add explicit steps:
-5. Continue with step 4 of the reboot path below (wipe `/var/lib/containers/`, restart the runtime).
+5. Wipe the runtime's storage and restart it:
+
+ ```bash
+ systemctl stop crio.service # or: systemctl stop containerd.service
+ rm -rvf /var/lib/containers/*
+ crio wipe -f # CRI-O only; containerd has no equivalent
+ systemctl start crio.service # or: systemctl start containerd.service
+ ```
+
+ If `rm -rvf` returns `Device or resource busy`, the runtime may still hold mounts. Ensure it is fully stopped, then retry.
+
+6. Start the kubelet:
+
+ ```bash
+ systemctl start kubelet.service
+ ```
+
+7. Wait a few minutes, then confirm from a control-plane host that the node has returned to `Ready`:
+
+ ```bash
+ kubectl get node <node>
+ ```
+
+8. Uncordon the node so that the scheduler may place pods on it again:
+
+ ```bash
+ kubectl uncordon <node>
+ ```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@docs/en/solutions/Cleaning_the_container_runtime_ephemeral_storage_on_a_node.md`
at line 64, The in-place workflow stops after wiping /var/lib/containers but
omits the remaining recovery steps; update the section that references "step 4
of the reboot path" to append explicit instructions to restart the container
runtime (systemctl start crio.service or containerd.service), start the kubelet
(systemctl start kubelet.service), wait for the node to report Ready (kubectl
get node <node>), and uncordon the node (kubectl uncordon <node>), and include a
note to retry the rm -rvf if mounts persist and to run crio wipe -f for CRI-O
where applicable.
c8c5349 to
de78a16
Compare
新增一篇 ACP KB 文章,归入
configure区域。✅ 自动化验证通过 — 可自动合并 — 2 / 2 条验证步骤在真实 Kubernetes 集群上按文章命令跑通(2026-05-02T16:35:29Z)。
configure区域建议 reviewer按
kb/OWNERS.md(来源:alauda-ai-base operator-list 的产品 owner)该区域候选自动挑选,@ 错了请无视。没有 GitHub handle 的贡献者(本区域相关请人工 ping):