Skip to content

feat(vm): cocoon vm net --nics N — runtime NIC resize#42

Merged
CMGS merged 33 commits into
masterfrom
feat/vm-net-resize
May 12, 2026
Merged

feat(vm): cocoon vm net --nics N — runtime NIC resize#42
CMGS merged 33 commits into
masterfrom
feat/vm-net-resize

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 12, 2026

Summary

  • New CLI cocoon vm net --nics N <vm> resizes the NIC count on a running VM (CH only). To add, cocoon allocates fresh TAP/CNI/bridge plumbing and calls vm.add-net; to remove, it pops NICs from the tail (LIFO) via vm.remove-device and tears down host plumbing.
  • network.Network interface refactor: Config(numNICs, existing...) → variadic Add(specs ...AddSpec) / Remove(indices ...int) for symmetric NIC-level ops; Delete and Remove share a tearDown helper.

Design decisions

  • Host plumbing torn down immediately after vm.remove-device returns — CH's HTTP API has no reliable eject signal (the slot/ioeventfd cleanup only happens when the guest writes B0EJ via ACPI/SCI). User accepts in-guest driver hang risk and is responsible for quiescing the NIC before reducing the count.
  • CH only for v1 — Firecracker has no NIC hot-plug/unplug API; the resize call returns netresize.ErrUnsupportedBackend.
  • CH device IDs derived from MAC (cocoon-net-<mac-no-colons>) — not persisted in types.NetworkConfig; remove path looks up live IDs via vm.info + MAC match so boot-time NICs (CH auto _netN) and cocoon-added NICs both work. clone.hotSwapNets uses the same id convention via the shared addCocoonNIC helper.
  • DB persistence with rollbacknetResizeAdd rolls back CH + plumbing on persistence failure; netResizeRemove always truncates the cocoon record once CH accepts the eject (leaked host plumbing is reclaimed by gc / cocoon vm rm) so the next resize can converge.
  • Zero-NIC VMs cannot be resized up — VMConfig has no bridge hint; the original provider choice is lost once all NICs are removed.

Test plan

  • make lint 0 issues on linux/amd64 + darwin/amd64
  • go test ./... all green
  • Manual: Linux cloudimg VM — boot with 1 NIC, resize to 3, verify DHCP on all three, resize back to 1
  • Manual: bridge backend — same flow
  • Manual: Windows VM — resize down after disabling NIC in guest
  • Manual: FC VM — vm net --nics 1 returns ErrUnsupportedBackend
  • Manual: clone smoke — verify hotSwapNets still works post-refactor, no warnings under normal flow

CMGS added 9 commits May 12, 2026 14:13
Add takes variadic AddSpec (per-NIC index + optional Existing for recovery).
Remove tears down individual NIC plumbing by index, preserves netns.
Extracted cniDel helper shared by Add (rollback+recovery), Remove, delNICs.
CNI: tearDownNICs(records, deleteTAP, bestEffort) → ids; both Delete and Remove
go through it. deleteRecords sweeps DB by id in one Update. delNICs stays for
gc.Collect (different lock context).

Bridge: tearDownTAPs(indices, bestEffort) shared by Remove and CleanupTAPs.
NetResize dispatches to per-NIC add or LIFO remove. Add path: plumbing.Add →
vm.add-net with deterministic cocoon-net-<mac> id → DB append; rolls back the
current NIC's host plumbing if vm.add-net fails. Remove path: build MAC→id map
from vm.info → vm.remove-device → optional plumbing.Remove (skipped via
--keep-host-on-remove) → DB truncate. Persists per NIC so a mid-op failure
leaves the record matching CH state.
CLI for resizing a running VM's NIC count. Wires the resolved Resizer
(currently CH only) with the matching host plumbing.

Refactors resolveAttacher to also return conf + hypervisor.Hypervisor so
NetResize can call Inspect to discover the active backend without re-resolving.
Hoists "bridge" / "cni" backend literals to types.BackendBridge /
types.BackendCNI so cmd, network/cni, and network/bridge share one identifier.
After the remove-device loop, re-fetch vm.info and warn for any old ID
still present in info.Config.Nets. The warning lets us trace bad
clone states without failing the flow (the eject pending state is a
CH-side condition the user accepted).

Also gitignore /scratch/ for local test scripts.
Signal isn't useful in the paused-clone window: vm.info reflects device_tree
which only updates when guest writes B0EJ via SCI; during pause B0EJ never
fires, so the check would always warn. The genuine remove-failure case
(unknown id) already errors out of removeDeviceVM.
Flag had no proper lifecycle: host TAP/veth/CNI would leak, then a later
resize-up at the same index would collide. Closing it cleanly needs orphan
tracking (pending-cleanup state on the VM record). Out of scope for v1 — drop
the flag entirely and always tear down host plumbing after vm.remove-device.
Bring it back later with full pending-state tracking if a real user needs it.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds runtime NIC hot-resize support for running Cloud Hypervisor VMs via a new cocoon vm net --nics N VM command, and refactors the network provider interface to support per-NIC add/remove operations needed by hot-plug workflows.

Changes:

  • Introduces vm net CLI + extend/netresize interface and implements CH-side resize logic (vm.add-net / vm.remove-device) with incremental DB persistence.
  • Refactors network.Network from bulk Config(...) to per-NIC Add(...) / Remove(...), updating CNI and bridge backends accordingly.
  • Documents operational limitations/risks (pending eject in CH; guest must quiesce NICs) in README + KNOWN_ISSUES.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
types/network.go Adds typed constants for persisted network backend identifiers.
README.md Documents the new vm net command and its operational caveats.
network/network.go Refactors network provider interface to per-NIC Add/Remove and adds helpers for ranges/recovery.
network/cni/create.go Reworks CNI setup into Add(specs...) and adds Remove(indices...).
network/cni/create_linux.go Adds helper to delete TAPs inside a netns (for per-NIC removal).
network/cni/create_darwin.go Adds darwin stub for TAP deletion helper.
network/cni/cni.go Refactors Delete flow to share teardown helper and adds record deletion helper.
network/bridge/bridge_other.go Updates non-linux bridge implementation to new interface (unsupported stubs).
network/bridge/bridge_linux.go Implements bridge Add/Remove and refactors cleanup helpers.
KNOWN_ISSUES.md Documents CH pending-eject behavior and implications during NIC hot-remove.
hypervisor/cloudhypervisor/netresize.go Implements CH runtime NIC resize orchestration + persistence.
hypervisor/cloudhypervisor/extend.go Registers CH as implementing the new netresize.Resizer interface.
hypervisor/cloudhypervisor/args.go Adds deterministic CH device ID helper for cocoon-managed NICs.
hypervisor/cloudhypervisor/api.go Extends vm.info config struct to include net devices.
extend/netresize/netresize.go Introduces netresize API (Spec/Result/Plumbing/Resizer).
cmd/vm/run.go Updates VM creation networking path to use Network.Add helpers.
cmd/vm/netresize.go Implements cocoon vm net command handler and provider selection.
cmd/vm/lifecycle.go Updates network recovery and provider selection to new interface/constants.
cmd/vm/commands.go Wires new vm net subcommand and --nics flag.
cmd/vm/attach.go Refactors shared resolveAttacher to also return config + hypervisor for vm net.
.gitignore Ignores local /scratch/ directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread network/cni/cni.go Outdated
Comment thread network/cni/cni.go
Comment thread cmd/vm/commands.go
Comment thread README.md Outdated
Comment thread hypervisor/cloudhypervisor/netresize.go Outdated
CMGS added 4 commits May 12, 2026 15:35
…vider lookup

- runningVMClientWithRecord exposes (hc, vmID, rec) so NetResize avoids the
  duplicate ResolveRef+LoadRecord that inspectRunning already triggered.
- plumbingForVM delegates to providerForVM (now lazy-inits CNI when nil),
  eliminating the duplicated backend switch / BridgeDev check.
- hotSwapNets sets the deterministic cocoonNetID(MAC) so clone-added NICs
  match netresize's id convention (both still resolved by MAC, but
  consistent in vm.info output).
- Trim netresize package comment; the operational caveats already live in
  README + KNOWN_ISSUES.
- netResizeAdd: if appendNetworkConfig fails after addNetVM succeeds, roll
  back vm.remove-device + host plumbing. Without this, the live NIC with
  cocoon-net-<mac> id stays in CH while cocoon's record drops, and the
  next resize collides on the deterministic id.
- netResizeRemove: once removeDeviceVM succeeds the CH eject is irrevocable,
  so truncate the cocoon record even when plumbing.Remove fails (leaked
  TAP/CNI is gc'd later). Skipping truncate left the next resize unable to
  resolve the stale NIC's MAC against vm.info.
- cni.tearDownNICs: when cniConf is nil and bestEffort, return all record
  ids so deleteVM still sweeps the DB instead of stranding them. In the
  non-bestEffort path (Remove), CNI DEL / TAP delete failures now
  propagate — silently logging would leak IPAM and report a misleading
  success.
- bridge.Add: track successfully-created TAPs and tear them down on any
  later failure. Reachable from boot (AddRange(0, nics)) where a partial
  failure previously leaked the earlier-created TAPs on the bridge.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Comment thread network/cni/create.go Outdated
Comment thread hypervisor/cloudhypervisor/netresize.go Outdated
CMGS added 4 commits May 12, 2026 15:44
"gc will reclaim" is misleading — running-VM mid-state leaks aren't in
the GC orphan scan; only stop+restart sweeps them. Adjust the warn
message accordingly.
- CNI Add: rollback defer now deletes TAPs left behind by setupTCRedirect
  when the netns was pre-existing (createdNetns=false). Without this, a
  partial failure inside the persistent netns leaked tap<prefix>-<idx>,
  and the next retry hit "file exists" in createTAP.
- NetResize: defer getVMInfo to the remove branch only; the add path
  doesn't need vm.info, so an add resize no longer depends on it.
Result gains Warnings []string. netResizeRemove appends a leak message when
plumbing.Remove fails after CH accepted the eject; the CLI logs each warning
and JSON output carries them. Without this, host plumbing leaks were only
visible in the daemon log.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 1 comment.

Comment thread network/cni/cni.go Outdated
After CNI.Remove was introduced for hot-resize, a VM resized down to 0 NICs
ends up with empty DB records but its netns intact (Remove preserves the
netns by design). vm rm then short-circuited on len(records)==0 and left
/var/run/netns/cocoon-<vmID> behind until gc ran. Drop the early return so
deleteNetns runs unconditionally (idempotent: ignores ENOENT).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Comment thread network/cni/cni.go Outdated
Comment thread hypervisor/cloudhypervisor/netresize.go Outdated
Comment thread hypervisor/cloudhypervisor/clone.go Outdated
CMGS added 2 commits May 12, 2026 16:23
vm stop/start does not run network teardown; only cocoon vm rm (via
Network.Delete) or gc reclaim the host plumbing. Replace the misleading
'stop+restart will reclaim' hint.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Comment thread hypervisor/cloudhypervisor/netresize.go Outdated
Comment thread network/cni/create.go Outdated
tearDownNICs now attempts every cleanup step per record (CNI DEL + TAP
delete) instead of bailing on the first failure, so the leaked TAP no
longer blocks a later resize-up at the same index. CNI.Remove sweeps DB
records for every picked index regardless of teardown outcome, matching
netResizeRemove's "CH eject is irrevocable, truncate even if plumbing
leaks" invariant. The teardown error still propagates so the caller can
surface the IPAM leak via Result.Warnings.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 3 comments.

Comment thread hypervisor/cloudhypervisor/netresize.go
Comment thread network/cni/cni.go
Comment thread network/cni/create.go Outdated
errors.Join surfaces a deleteRecords failure even when tearDownNICs also
failed — the two indicate different problems (host teardown vs DB
divergence) and shouldn't mask each other.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.

Comment thread network/cni/cni.go Outdated
deleteVM previously relied on tearDownNICs' returned ids, so a record
whose persisted conflist name no longer resolves was logged-and-skipped
inside tearDownNICs and never deleted from the DB. Sweep all records of
this VM after teardown instead. Drops the unused []string return from
tearDownNICs since CNI.Remove also pre-collects pickedIDs.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated no new comments.

CMGS added 3 commits May 12, 2026 17:46
Per project rule "never dereference pointers without nil check". Both call
sites do nc.MAC; a nil entry in rec.NetworkConfigs would panic. Surface a
clear error instead.
After vm.remove-device CH only flags pending eject; the device stays in
device_tree until the guest writes B0EJ via ACPI/SCI. Pause/snapshot
iterate device_tree and error out with "receiving on a closed channel"
(VMM-thread response channel collapse) when called too soon.

Poll vm.info's device_tree for absence of the chID up to 15 s before
teardown. Linux acks in < 1 s; Windows can take a few seconds. The wait
is scoped to netresize remove — clone's hotSwapNets runs paused (guest
can't ACK) and we accept the existing CH limitation there.
@CMGS CMGS force-pushed the feat/vm-net-resize branch from b9479c2 to e642cce Compare May 12, 2026 11:26
CMGS added 2 commits May 12, 2026 19:41
…0s timeout

Two findings from final closure review:

1. netResizeAdd's persist-failure rollback called removeDeviceVM then
   plumbing.Remove with no wait between. Symmetric to the resize-down race
   that a3d41cc fixed: a fast plumbing.Remove can yank the TAP while CH's
   virtio-net ioeventfd is still attached. Mirror the same waitDeviceEjected
   bracket so the rollback can't recreate the channel-closed condition.

2. 5 s ejectWaitTimeout was tight for Windows. Bump to 10 s — Linux still
   acks in < 1 s; Windows NDIS can take a few.
Collapsed 8 multi-liners I added (resolveAttacher, plumbingForVM,
providerForVM, Result, addCocoonNIC, ejectWaitTimeout, the symmetric-
rollback inline, tearDownNICs, Remove) into single lines. Pre-existing
multi-line comments left alone.
@CMGS CMGS merged commit 8bbb518 into master May 12, 2026
4 checks passed
@CMGS CMGS deleted the feat/vm-net-resize branch May 12, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants