From 95d4a27b858df4199262efff9b611f4bd437b97e Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Fri, 8 May 2026 00:11:27 -0700 Subject: [PATCH] feat(vm): boot sandboxes from ext4 root disks --- architecture/compute-runtimes.md | 2 +- crates/openshell-driver-vm/README.md | 17 +- .../runtime/kernel/openshell.kconfig | 7 + .../scripts/openshell-vm-sandbox-init.sh | 14 +- crates/openshell-driver-vm/src/driver.rs | 291 +++++++----------- crates/openshell-driver-vm/src/ffi.rs | 23 +- crates/openshell-driver-vm/src/main.rs | 12 +- crates/openshell-driver-vm/src/rootfs.rs | 276 ++++++++++++++++- crates/openshell-driver-vm/src/runtime.rs | 165 ++++------ docs/reference/sandbox-compute-drivers.mdx | 2 + e2e/rust/e2e-vm.sh | 18 +- 11 files changed, 517 insertions(+), 310 deletions(-) diff --git a/architecture/compute-runtimes.md b/architecture/compute-runtimes.md index 9ab512f98..d640aea5f 100644 --- a/architecture/compute-runtimes.md +++ b/architecture/compute-runtimes.md @@ -23,7 +23,7 @@ Each runtime receives a sandbox spec from the gateway and is responsible for: | Docker | Local development with Docker available. | Container plus nested sandbox namespace. | Uses host networking so loopback gateway endpoints work from the supervisor. | | Podman | Rootless or single-machine deployments. | Container plus nested sandbox namespace. | Uses the Podman REST API, OCI image volumes, and CDI GPU devices when available. | | Kubernetes | Cluster deployment through Helm. | Pod plus nested sandbox namespace. | Uses Kubernetes API objects, service accounts, secrets, PVC-backed workspace storage, and GPU resources. | -| VM | Experimental microVM isolation. | Per-sandbox libkrun VM. | Gateway spawns `openshell-driver-vm` as a subprocess over a private, state-local Unix socket. | +| VM | Experimental microVM isolation. | Per-sandbox libkrun VM. | Gateway spawns `openshell-driver-vm` as a subprocess over a private, state-local Unix socket. The VM driver caches a prepared `rootfs.ext4` per source image and copies it per sandbox, so guest ownership metadata lives inside the ext4 filesystem instead of host directory entries. | VM runtime state paths are derived only from driver-validated sandbox IDs matching `[A-Za-z0-9._-]{1,128}`. The gateway-owned VM driver socket uses a diff --git a/crates/openshell-driver-vm/README.md b/crates/openshell-driver-vm/README.md index 0a11ceb0a..ca5446cf1 100644 --- a/crates/openshell-driver-vm/README.md +++ b/crates/openshell-driver-vm/README.md @@ -2,7 +2,7 @@ > Status: Experimental. The VM compute driver is under active development and the interface still has VM-specific plumbing that will be generalized. -Standalone libkrun-backed [`ComputeDriver`](../../proto/compute_driver.proto) for OpenShell. The gateway spawns this binary as a subprocess, talks to it over a Unix domain socket with the `openshell.compute.v1.ComputeDriver` gRPC surface, and lets it manage per-sandbox microVMs. The runtime (libkrun + libkrunfw + gvproxy) and the sandbox supervisor are embedded directly in the binary; each sandbox guest rootfs is derived from a configured container image at create time. +Standalone libkrun-backed [`ComputeDriver`](../../proto/compute_driver.proto) for OpenShell. The gateway spawns this binary as a subprocess, talks to it over a Unix domain socket with the `openshell.compute.v1.ComputeDriver` gRPC surface, and lets it manage per-sandbox microVMs. The runtime (libkrun + libkrunfw + gvproxy) and the sandbox supervisor are embedded directly in the binary; each sandbox boots from a copied ext4 root disk derived from the configured container image. ## How it fits together @@ -42,7 +42,7 @@ By default `mise run gateway:vm`: - Listens on plaintext HTTP at `127.0.0.1:18081`. - Registers the CLI gateway `vm-dev` by writing `~/.config/openshell/gateways/vm-dev/metadata.json`. It does not modify the workspace `.env`. - Persists the gateway SQLite DB under `.cache/gateway-vm/gateway.db`. -- Places the VM driver state (per-sandbox rootfs plus `run/compute-driver.sock`) under `/tmp/openshell-vm-driver-$USER-vm-dev/` so the AF_UNIX socket path stays under macOS `SUN_LEN`. +- Places the VM driver state (per-sandbox `rootfs.ext4`, image cache, and `run/compute-driver.sock`) under `/tmp/openshell-vm-driver-$USER-vm-dev/` so the AF_UNIX socket path stays under macOS `SUN_LEN`. - Passes `--driver-dir $PWD/target/debug` so the freshly built `openshell-driver-vm` is used instead of an older installed copy from `~/.local/libexec/openshell`, `/usr/libexec/openshell`, or `/usr/local/libexec`. For GPU passthrough (VFIO), pass `-- --gpu` and run with root privileges: @@ -124,7 +124,7 @@ The gateway resolves `openshell-driver-vm` in this order: `--driver-dir`, conven |---|---|---|---| | `--drivers vm` | `OPENSHELL_DRIVERS` | `kubernetes` | Select the VM compute driver. | | `--grpc-endpoint URL` | `OPENSHELL_GRPC_ENDPOINT` | — | Required. URL the sandbox guest dials to reach the gateway. Use `http://host.containers.internal:` (or `host.docker.internal` / `host.openshell.internal`) so traffic flows through gvproxy's host-loopback NAT (HostIP `192.168.127.254` → host `127.0.0.1`). Loopback URLs like `http://127.0.0.1:` are rewritten automatically by the driver. The bare gateway IP (`192.168.127.1`) only carries gvproxy's own services and will not reach host-bound ports. | -| `--vm-driver-state-dir DIR` | `OPENSHELL_VM_DRIVER_STATE_DIR` | `target/openshell-vm-driver` | Per-sandbox rootfs, console logs, image cache, and private `run/compute-driver.sock` UDS. | +| `--vm-driver-state-dir DIR` | `OPENSHELL_VM_DRIVER_STATE_DIR` | `target/openshell-vm-driver` | Per-sandbox root disk images, console logs, image cache, and private `run/compute-driver.sock` UDS. | | `--driver-dir DIR` | `OPENSHELL_DRIVER_DIR` | unset | Override the directory searched for `openshell-driver-vm`. | | `--vm-driver-vcpus N` | `OPENSHELL_VM_DRIVER_VCPUS` | `2` | vCPUs per sandbox. | | `--vm-driver-mem-mib N` | `OPENSHELL_VM_DRIVER_MEM_MIB` | `2048` | Memory per sandbox, in MiB. | @@ -145,7 +145,15 @@ The gateway is auto-registered by `mise run gateway:vm`. In another terminal: ./scripts/bin/openshell sandbox connect demo ``` -First sandbox takes 10–30 seconds to boot (image fetch/prepare/cache + libkrun + guest init). If `--from` is omitted, the VM driver uses the gateway's configured default sandbox image. Without either `--from` or `--sandbox-image`, VM sandbox creation fails. Subsequent creates reuse the prepared sandbox rootfs. +First sandbox takes 10–30 seconds to boot (image fetch/prepare/cache + libkrun + guest init). If `--from` is omitted, the VM driver uses the gateway's configured default sandbox image. Without either `--from` or `--sandbox-image`, VM sandbox creation fails. Subsequent creates reuse the prepared image cache and copy its sparse root disk into the sandbox state directory before boot. + +During rootfs preparation the VM driver exports or pulls the selected OCI image, +applies the OpenShell guest mutations, formats a sparse `rootfs.ext4`, and +caches it under `/images//rootfs.ext4`. Each sandbox gets +its own copied `rootfs.ext4` under `/sandboxes//`. The host owns +only the image file; guest ownership such as `/sandbox` UID/GID metadata lives +inside the ext4 filesystem and is corrected by guest init before the supervisor +starts. ## Logs and debugging @@ -162,6 +170,7 @@ The VM guest's serial console is appended to `//console.l - macOS on Apple Silicon, or Linux on aarch64/x86_64 with KVM - Rust toolchain +- e2fsprogs (`mke2fs` or `mkfs.ext4`, plus `debugfs`) for root disk image creation and per-sandbox file injection - Guest-supervisor cross-compile toolchain (needed on macOS, and on Linux when host arch ≠ guest arch): - Matching rustup target: `rustup target add aarch64-unknown-linux-gnu` (or `x86_64-unknown-linux-gnu` for an amd64 guest) - `cargo install --locked cargo-zigbuild` and `brew install zig` (or distro equivalent). `vm:supervisor` uses `cargo zigbuild` to cross-compile the in-VM `openshell-sandbox` supervisor binary. diff --git a/crates/openshell-driver-vm/runtime/kernel/openshell.kconfig b/crates/openshell-driver-vm/runtime/kernel/openshell.kconfig index b5f0330af..1248773d9 100644 --- a/crates/openshell-driver-vm/runtime/kernel/openshell.kconfig +++ b/crates/openshell-driver-vm/runtime/kernel/openshell.kconfig @@ -8,6 +8,13 @@ # # See also: check-vm-capabilities.sh for runtime verification. +# ── Root disk transport and filesystem ───────────────────────────────── +CONFIG_BLOCK=y +CONFIG_BLK_DEV=y +CONFIG_VIRTIO_BLK=y +CONFIG_EXT4_FS=y +CONFIG_EXT4_USE_FOR_EXT2=y + # ── Network Namespaces (required for pod isolation) ───────────────────── CONFIG_NET_NS=y CONFIG_NAMESPACES=y diff --git a/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh b/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh index b61fd4900..707ad699d 100644 --- a/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh +++ b/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh @@ -239,7 +239,8 @@ setup_gpu() { return 1 fi - # Stage GSP firmware from virtiofs to tmpfs to avoid slow FUSE reads + # Stage GSP firmware to tmpfs so module loading reads it from a stable + # early-boot path. if [ -d /lib/firmware/nvidia ]; then ts "staging GPU firmware to tmpfs" mkdir -p /run/firmware/nvidia @@ -273,6 +274,15 @@ setup_gpu() { fi } +setup_sandbox_workdir() { + mkdir -p /sandbox + if ! chown -R sandbox:sandbox /sandbox 2>/dev/null; then + chown -R 10001:10001 /sandbox + fi + chmod 0755 /sandbox + ts "prepared /sandbox ownership" +} + mount -t proc proc /proc 2>/dev/null & mount -t sysfs sysfs /sys 2>/dev/null & mount -t tmpfs tmpfs /tmp 2>/dev/null & @@ -286,6 +296,8 @@ mount -t tmpfs tmpfs /dev/shm 2>/dev/null & mount -t cgroup2 cgroup2 /sys/fs/cgroup 2>/dev/null & wait +setup_sandbox_workdir + hostname openshell-sandbox-vm 2>/dev/null || true ip link set lo up 2>/dev/null || true diff --git a/crates/openshell-driver-vm/src/driver.rs b/crates/openshell-driver-vm/src/driver.rs index b797f4835..eb43ffa71 100644 --- a/crates/openshell-driver-vm/src/driver.rs +++ b/crates/openshell-driver-vm/src/driver.rs @@ -5,8 +5,9 @@ use crate::gpu::{ GpuInventory, SubnetAllocator, allocate_vsock_cid, mac_from_sandbox_id, tap_device_name, }; use crate::rootfs::{ - create_rootfs_archive_from_dir, extract_rootfs_archive_to, - prepare_sandbox_rootfs_from_image_root, sandbox_guest_init_path, + copy_rootfs_image_to, create_rootfs_image_from_dir, extract_rootfs_archive_to, + prepare_sandbox_rootfs_from_image_root, sandbox_guest_init_path, set_rootfs_image_file_mode, + write_rootfs_image_file, }; use bollard::Docker; use bollard::errors::Error as BollardError; @@ -37,7 +38,6 @@ use std::collections::{HashMap, HashSet}; use std::fs; use std::io::Read; use std::net::Ipv4Addr; -use std::os::unix::fs::PermissionsExt; use std::path::{Component, Path, PathBuf}; use std::pin::Pin; use std::process::Stdio; @@ -84,13 +84,13 @@ const OPENSHELL_HOST_GATEWAY_ALIAS: &str = "host.openshell.internal"; /// `GVPROXY_HOST_LOOPBACK_IP` — they do **not** go through the gateway IP. const GVPROXY_HOST_LOOPBACK_ALIAS: &str = "host.containers.internal"; const GUEST_SSH_SOCKET_PATH: &str = "/run/openshell/ssh.sock"; -const GUEST_TLS_DIR: &str = "/opt/openshell/tls"; const GUEST_TLS_CA_PATH: &str = "/opt/openshell/tls/ca.crt"; const GUEST_TLS_CERT_PATH: &str = "/opt/openshell/tls/tls.crt"; const GUEST_TLS_KEY_PATH: &str = "/opt/openshell/tls/tls.key"; const IMAGE_CACHE_ROOT_DIR: &str = "images"; -const IMAGE_CACHE_ROOTFS_ARCHIVE: &str = "rootfs.tar"; +const IMAGE_CACHE_ROOTFS_IMAGE: &str = "rootfs.ext4"; const IMAGE_EXPORT_ROOTFS_ARCHIVE: &str = "source-rootfs.tar"; +const IMAGE_CACHE_LAYOUT_VERSION: &str = "sandbox-rootfs-ext4-v1"; const IMAGE_IDENTITY_FILE: &str = "image-identity"; const IMAGE_REFERENCE_FILE: &str = "image-reference"; static IMAGE_CACHE_BUILD_COUNTER: AtomicU64 = AtomicU64::new(0); @@ -363,7 +363,7 @@ impl VmDriver { let gpu_device = spec.map_or("", |s| s.gpu_device.as_str()); let state_dir = sandbox_state_dir(&self.config.state_dir, &sandbox.id)?; - let rootfs = state_dir.join("rootfs"); + let root_disk = state_dir.join(IMAGE_CACHE_ROOTFS_IMAGE); let image_ref = self.resolved_sandbox_image(sandbox).ok_or_else(|| { Status::failed_precondition( "vm sandboxes require template.image or a configured default sandbox image", @@ -373,7 +373,7 @@ impl VmDriver { sandbox_id = %sandbox.id, image_ref = %image_ref, state_dir = %state_dir.display(), - "vm driver: resolved image ref, preparing rootfs" + "vm driver: resolved image ref, preparing root disk" ); tokio::fs::create_dir_all(&state_dir) @@ -398,14 +398,14 @@ impl VmDriver { ); let image_identity = match self - .prepare_runtime_rootfs(&sandbox.id, &image_ref, &rootfs) + .prepare_runtime_rootfs(&sandbox.id, &image_ref, &root_disk) .await { Ok(image_identity) => { info!( sandbox_id = %sandbox.id, image_identity = %image_identity, - "vm driver: rootfs prepared" + "vm driver: root disk prepared" ); image_identity } @@ -413,14 +413,14 @@ impl VmDriver { warn!( sandbox_id = %sandbox.id, error = %err.message(), - "vm driver: rootfs preparation failed" + "vm driver: root disk preparation failed" ); let _ = tokio::fs::remove_dir_all(&state_dir).await; return Err(err); } }; if let Some(tls_paths) = tls_paths.as_ref() - && let Err(err) = prepare_guest_tls_materials(&rootfs, tls_paths).await + && let Err(err) = prepare_guest_tls_materials(&root_disk, tls_paths).await { let _ = tokio::fs::remove_dir_all(&state_dir).await; return Err(Status::internal(format!( @@ -474,7 +474,7 @@ impl VmDriver { command.stdout(Stdio::inherit()); command.stderr(Stdio::inherit()); command.arg("--internal-run-vm"); - command.arg("--vm-rootfs").arg(&rootfs); + command.arg("--vm-root-disk").arg(&root_disk); command.arg("--vm-exec").arg(sandbox_guest_init_path()); command.arg("--vm-workdir").arg("/"); command.arg("--vm-console-output").arg(&console_output); @@ -733,17 +733,17 @@ impl VmDriver { &self, sandbox_id: &str, image_ref: &str, - rootfs: &Path, + root_disk: &Path, ) -> Result { let image_identity = self - .ensure_cached_image_rootfs_archive(sandbox_id, image_ref) + .ensure_cached_image_rootfs_image(sandbox_id, image_ref) .await?; - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, &image_identity); - let rootfs_dest = rootfs.to_path_buf(); - tokio::task::spawn_blocking(move || extract_rootfs_archive_to(&archive_path, &rootfs_dest)) + let cached_image = image_cache_rootfs_image(&self.config.state_dir, &image_identity); + let root_disk_dest = root_disk.to_path_buf(); + tokio::task::spawn_blocking(move || copy_rootfs_image_to(&cached_image, &root_disk_dest)) .await - .map_err(|err| Status::internal(format!("sandbox rootfs extraction panicked: {err}")))? - .map_err(|err| Status::internal(format!("extract sandbox rootfs failed: {err}")))?; + .map_err(|err| Status::internal(format!("sandbox root disk copy panicked: {err}")))? + .map_err(|err| Status::internal(format!("copy sandbox root disk failed: {err}")))?; Ok(image_identity) } @@ -757,14 +757,14 @@ impl VmDriver { }) } - async fn ensure_cached_image_rootfs_archive( + async fn ensure_cached_image_rootfs_image( &self, sandbox_id: &str, image_ref: &str, ) -> Result { if let Some((docker, image_identity)) = self.resolve_local_docker_image(image_ref).await? { return self - .ensure_cached_local_image_rootfs_archive( + .ensure_cached_local_image_rootfs_image( sandbox_id, image_ref, &docker, @@ -773,7 +773,7 @@ impl VmDriver { .await; } - info!(image_ref = %image_ref, "vm driver: ensuring cached image rootfs archive (registry)"); + info!(image_ref = %image_ref, "vm driver: ensuring cached root disk image (registry)"); let reference = parse_registry_reference(image_ref)?; let client = registry_client(); let auth = registry_auth(image_ref)?; @@ -787,7 +787,7 @@ impl VmDriver { )) })?; info!(image_ref = %image_ref, "vm driver: fetching manifest digest"); - let image_identity = client + let source_image_identity = client .fetch_manifest_digest(&reference, &auth) .await .map_err(|err| { @@ -797,10 +797,11 @@ impl VmDriver { })?; info!( image_ref = %image_ref, - image_identity = %image_identity, + image_identity = %source_image_identity, "vm driver: manifest digest resolved" ); - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, &image_identity); + let image_identity = prepared_image_cache_identity(&source_image_identity); + let image_path = image_cache_rootfs_image(&self.config.state_dir, &image_identity); // Mirror the K8s `Pulling` event so the CLI flips to the // image-pull spinner with the image name as detail. We emit it @@ -816,37 +817,37 @@ impl VmDriver { ), ); - if tokio::fs::metadata(&archive_path).await.is_ok() { + if tokio::fs::metadata(&image_path).await.is_ok() { info!( image_identity = %image_identity, - archive_path = %archive_path.display(), - "vm driver: image rootfs archive cache hit (no build needed)" + image_path = %image_path.display(), + "vm driver: root disk image cache hit (no build needed)" ); - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; return Ok(image_identity); } info!( image_identity = %image_identity, - "vm driver: image rootfs archive cache miss, acquiring build lock" + "vm driver: root disk image cache miss, acquiring build lock" ); let _cache_guard = self.image_cache_lock.lock().await; info!( image_identity = %image_identity, "vm driver: build lock acquired" ); - if tokio::fs::metadata(&archive_path).await.is_ok() { + if tokio::fs::metadata(&image_path).await.is_ok() { info!( image_identity = %image_identity, - "vm driver: image rootfs archive cache hit after lock (built by another task)" + "vm driver: root disk image cache hit after lock (built by another task)" ); - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; return Ok(image_identity); } - self.build_cached_registry_image_rootfs_archive( + self.build_cached_registry_image_rootfs_image( sandbox_id, &client, &reference, @@ -855,7 +856,7 @@ impl VmDriver { &image_identity, ) .await?; - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; Ok(image_identity) } @@ -936,14 +937,15 @@ impl VmDriver { } } - async fn ensure_cached_local_image_rootfs_archive( + async fn ensure_cached_local_image_rootfs_image( &self, sandbox_id: &str, image_ref: &str, docker: &Docker, image_identity: &str, ) -> Result { - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, image_identity); + let cache_identity = prepared_image_cache_identity(image_identity); + let image_path = image_cache_rootfs_image(&self.config.state_dir, &cache_identity); self.publish_platform_event( sandbox_id.to_string(), @@ -955,38 +957,38 @@ impl VmDriver { ), ); - if tokio::fs::metadata(&archive_path).await.is_ok() { - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + if tokio::fs::metadata(&image_path).await.is_ok() { + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; - return Ok(image_identity.to_string()); + return Ok(cache_identity); } let _cache_guard = self.image_cache_lock.lock().await; - if tokio::fs::metadata(&archive_path).await.is_ok() { - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + if tokio::fs::metadata(&image_path).await.is_ok() { + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; - return Ok(image_identity.to_string()); + return Ok(cache_identity); } - self.build_cached_local_image_rootfs_archive(docker, image_ref, image_identity) + self.build_cached_local_image_rootfs_image(docker, image_ref, &cache_identity) .await?; - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; - Ok(image_identity.to_string()) + Ok(cache_identity) } - async fn build_cached_local_image_rootfs_archive( + async fn build_cached_local_image_rootfs_image( &self, docker: &Docker, image_ref: &str, image_identity: &str, ) -> Result<(), Status> { let cache_dir = image_cache_dir(&self.config.state_dir, image_identity); - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, image_identity); + let image_path = image_cache_rootfs_image(&self.config.state_dir, image_identity); let staging_dir = image_cache_staging_dir(&self.config.state_dir, image_identity); let exported_rootfs = staging_dir.join(IMAGE_EXPORT_ROOTFS_ARCHIVE); let prepared_rootfs = staging_dir.join("rootfs"); - let prepared_archive = staging_dir.join(IMAGE_CACHE_ROOTFS_ARCHIVE); + let prepared_image = staging_dir.join(IMAGE_CACHE_ROOTFS_IMAGE); tokio::fs::create_dir_all(image_cache_root_dir(&self.config.state_dir)) .await @@ -1021,14 +1023,14 @@ impl VmDriver { let image_identity_owned = image_identity.to_string(); let exported_rootfs_for_build = exported_rootfs.clone(); let prepared_rootfs_for_build = prepared_rootfs.clone(); - let prepared_archive_for_build = prepared_archive.clone(); + let prepared_image_for_build = prepared_image.clone(); let build_result = tokio::task::spawn_blocking(move || { - prepare_exported_rootfs_archive( + prepare_exported_rootfs_image( &image_ref_owned, &image_identity_owned, &exported_rootfs_for_build, &prepared_rootfs_for_build, - &prepared_archive_for_build, + &prepared_image_for_build, ) }) .await @@ -1039,19 +1041,19 @@ impl VmDriver { return Err(Status::failed_precondition(err)); } - if tokio::fs::metadata(&archive_path).await.is_ok() { + if tokio::fs::metadata(&image_path).await.is_ok() { let _ = tokio::fs::remove_dir_all(&staging_dir).await; return Ok(()); } - tokio::fs::rename(&prepared_archive, &archive_path) + tokio::fs::rename(&prepared_image, &image_path) .await - .map_err(|err| Status::internal(format!("store cached image rootfs failed: {err}")))?; + .map_err(|err| Status::internal(format!("store cached rootfs image failed: {err}")))?; let _ = tokio::fs::remove_dir_all(&staging_dir).await; Ok(()) } - async fn build_cached_registry_image_rootfs_archive( + async fn build_cached_registry_image_rootfs_image( &self, sandbox_id: &str, client: &OciClient, @@ -1061,10 +1063,10 @@ impl VmDriver { image_identity: &str, ) -> Result<(), Status> { let cache_dir = image_cache_dir(&self.config.state_dir, image_identity); - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, image_identity); + let image_path = image_cache_rootfs_image(&self.config.state_dir, image_identity); let staging_dir = image_cache_staging_dir(&self.config.state_dir, image_identity); let prepared_rootfs = staging_dir.join("rootfs"); - let prepared_archive = staging_dir.join(IMAGE_CACHE_ROOTFS_ARCHIVE); + let prepared_image = staging_dir.join(IMAGE_CACHE_ROOTFS_IMAGE); tokio::fs::create_dir_all(image_cache_root_dir(&self.config.state_dir)) .await @@ -1115,13 +1117,13 @@ impl VmDriver { } info!( image_ref = %image_ref, - "vm driver: image layers pulled, preparing rootfs archive" + "vm driver: image layers pulled, preparing rootfs image" ); let image_ref_owned = image_ref.to_string(); let image_identity_owned = image_identity.to_string(); let prepared_rootfs_for_build = prepared_rootfs.clone(); - let prepared_archive_for_build = prepared_archive.clone(); + let prepared_image_for_build = prepared_image.clone(); let build_result = tokio::task::spawn_blocking(move || { prepare_sandbox_rootfs_from_image_root( &prepared_rootfs_for_build, @@ -1130,7 +1132,7 @@ impl VmDriver { .map_err(|err| { format!("vm sandbox image '{image_ref_owned}' is not base-compatible: {err}") })?; - create_rootfs_archive_from_dir(&prepared_rootfs_for_build, &prepared_archive_for_build) + create_rootfs_image_from_dir(&prepared_rootfs_for_build, &prepared_image_for_build) }) .await .map_err(|err| Status::internal(format!("image rootfs preparation panicked: {err}")))?; @@ -1139,28 +1141,28 @@ impl VmDriver { warn!( image_ref = %image_ref, error = %err, - "vm driver: rootfs archive build failed" + "vm driver: rootfs image build failed" ); let _ = tokio::fs::remove_dir_all(&staging_dir).await; return Err(Status::failed_precondition(err)); } - if tokio::fs::metadata(&archive_path).await.is_ok() { + if tokio::fs::metadata(&image_path).await.is_ok() { info!( image_identity = %image_identity, - "vm driver: another task wrote archive while we were building, discarding ours" + "vm driver: another task wrote image while we were building, discarding ours" ); let _ = tokio::fs::remove_dir_all(&staging_dir).await; return Ok(()); } - tokio::fs::rename(&prepared_archive, &archive_path) + tokio::fs::rename(&prepared_image, &image_path) .await - .map_err(|err| Status::internal(format!("store cached image rootfs failed: {err}")))?; + .map_err(|err| Status::internal(format!("store cached rootfs image failed: {err}")))?; info!( image_identity = %image_identity, - archive_path = %archive_path.display(), - "vm driver: image rootfs archive committed to cache" + image_path = %image_path.display(), + "vm driver: root disk image committed to cache" ); let _ = tokio::fs::remove_dir_all(&staging_dir).await; Ok(()) @@ -1633,17 +1635,17 @@ async fn export_local_image_rootfs_to_path( } } -fn prepare_exported_rootfs_archive( +fn prepare_exported_rootfs_image( image_ref: &str, image_identity: &str, exported_rootfs: &Path, prepared_rootfs: &Path, - prepared_archive: &Path, + prepared_image: &Path, ) -> Result<(), String> { extract_rootfs_archive_to(exported_rootfs, prepared_rootfs)?; prepare_sandbox_rootfs_from_image_root(prepared_rootfs, image_identity) .map_err(|err| format!("vm sandbox image '{image_ref}' is not base-compatible: {err}"))?; - create_rootfs_archive_from_dir(prepared_rootfs, prepared_archive) + create_rootfs_image_from_dir(prepared_rootfs, prepared_image) } fn registry_client() -> OciClient { @@ -1807,8 +1809,8 @@ impl VmDriver { /// Emit a `Pulled` platform event with a message that mirrors the /// kubelet's `Successfully pulled image ... Image size: N bytes.` /// format so the CLI's `extract_image_size` parser works unchanged. - async fn publish_pulled_event(&self, sandbox_id: &str, image_ref: &str, archive_path: &Path) { - let size_suffix = tokio::fs::metadata(archive_path).await.map_or_else( + async fn publish_pulled_event(&self, sandbox_id: &str, image_ref: &str, image_path: &Path) { + let size_suffix = tokio::fs::metadata(image_path).await.map_or_else( |_| String::new(), |meta| format!(" Image size: {} bytes.", meta.len()), ); @@ -2341,8 +2343,8 @@ fn image_cache_dir(root: &Path, image_identity: &str) -> PathBuf { image_cache_root_dir(root).join(sanitize_image_identity(image_identity)) } -fn image_cache_rootfs_archive(root: &Path, image_identity: &str) -> PathBuf { - image_cache_dir(root, image_identity).join(IMAGE_CACHE_ROOTFS_ARCHIVE) +fn image_cache_rootfs_image(root: &Path, image_identity: &str) -> PathBuf { + image_cache_dir(root, image_identity).join(IMAGE_CACHE_ROOTFS_IMAGE) } fn image_cache_staging_dir(root: &Path, image_identity: &str) -> PathBuf { @@ -2353,6 +2355,10 @@ fn image_cache_staging_dir(root: &Path, image_identity: &str) -> PathBuf { )) } +fn prepared_image_cache_identity(image_identity: &str) -> String { + format!("{IMAGE_CACHE_LAYOUT_VERSION}:{image_identity}") +} + fn sanitize_image_identity(image_identity: &str) -> String { image_identity .chars() @@ -2391,26 +2397,28 @@ async fn write_sandbox_image_metadata( } async fn prepare_guest_tls_materials( - rootfs: &Path, + root_disk: &Path, paths: &VmDriverTlsPaths, -) -> Result<(), std::io::Error> { - let guest_tls_dir = rootfs.join(GUEST_TLS_DIR.trim_start_matches('/')); - tokio::fs::create_dir_all(&guest_tls_dir).await?; - - copy_guest_tls_material(&paths.ca, &guest_tls_dir.join("ca.crt"), 0o644).await?; - copy_guest_tls_material(&paths.cert, &guest_tls_dir.join("tls.crt"), 0o644).await?; - copy_guest_tls_material(&paths.key, &guest_tls_dir.join("tls.key"), 0o600).await?; - Ok(()) -} +) -> Result<(), String> { + let ca = tokio::fs::read(&paths.ca) + .await + .map_err(|err| format!("read {}: {err}", paths.ca.display()))?; + let cert = tokio::fs::read(&paths.cert) + .await + .map_err(|err| format!("read {}: {err}", paths.cert.display()))?; + let key = tokio::fs::read(&paths.key) + .await + .map_err(|err| format!("read {}: {err}", paths.key.display()))?; + let root_disk = root_disk.to_path_buf(); -async fn copy_guest_tls_material( - source: &Path, - dest: &Path, - mode: u32, -) -> Result<(), std::io::Error> { - tokio::fs::copy(source, dest).await?; - tokio::fs::set_permissions(dest, fs::Permissions::from_mode(mode)).await?; - Ok(()) + tokio::task::spawn_blocking(move || { + write_rootfs_image_file(&root_disk, GUEST_TLS_CA_PATH, &ca)?; + write_rootfs_image_file(&root_disk, GUEST_TLS_CERT_PATH, &cert)?; + write_rootfs_image_file(&root_disk, GUEST_TLS_KEY_PATH, &key)?; + set_rootfs_image_file_mode(&root_disk, GUEST_TLS_KEY_PATH, 0o100_600) + }) + .await + .map_err(|err| format!("guest TLS material injection panicked: {err}"))? } async fn terminate_vm_process(child: &mut Child) -> Result<(), std::io::Error> { @@ -3188,54 +3196,11 @@ mod tests { } #[test] - fn prepare_exported_rootfs_archive_rewrites_docker_exported_rootfs() { - let base = unique_temp_dir(); - let source_rootfs = base.join("source-rootfs"); - let exported_rootfs = base.join("exported-rootfs.tar"); - let prepared_rootfs = base.join("prepared-rootfs"); - let prepared_archive = base.join("prepared-rootfs.tar"); - let extracted = base.join("extracted"); - - for path in [ - "bin/bash", - "bin/mount", - "bin/sed", - "sbin/ip", - "opt/openshell/bin/openshell-sandbox", - ] { - let path = source_rootfs.join(path); - fs::create_dir_all(path.parent().unwrap()).unwrap(); - fs::write(path, "").unwrap(); - } - - create_rootfs_archive_from_dir(&source_rootfs, &exported_rootfs).unwrap(); - prepare_exported_rootfs_archive( - "openshell/sandbox-from:123", - "sha256:local-image", - &exported_rootfs, - &prepared_rootfs, - &prepared_archive, - ) - .unwrap(); - extract_rootfs_archive_to(&prepared_archive, &extracted).unwrap(); - - assert!(extracted.join("srv/openshell-vm-sandbox-init.sh").is_file()); - assert!( - extracted - .join("opt/openshell/bin/openshell-sandbox") - .is_file() - ); + fn prepared_image_cache_identity_includes_rootfs_layout_version() { assert_eq!( - fs::read_to_string(extracted.join("opt/openshell/.rootfs-type")).unwrap(), - "sandbox\n" + prepared_image_cache_identity("sha256:local-image"), + format!("{IMAGE_CACHE_LAYOUT_VERSION}:sha256:local-image") ); - assert!( - fs::read_to_string(extracted.join(".openshell-rootfs-variant")) - .unwrap() - .contains("sha256:local-image") - ); - - let _ = fs::remove_dir_all(base); } #[test] @@ -3247,50 +3212,22 @@ mod tests { } #[tokio::test] - async fn prepare_guest_tls_materials_copies_bundle_into_rootfs() { + async fn prepare_guest_tls_materials_reports_missing_input() { let base = unique_temp_dir(); - let source_dir = base.join("source"); - let rootfs = base.join("rootfs"); - std::fs::create_dir_all(&source_dir).unwrap(); - std::fs::create_dir_all(&rootfs).unwrap(); - - let ca = source_dir.join("ca.crt"); - let cert = source_dir.join("tls.crt"); - let key = source_dir.join("tls.key"); - std::fs::write(&ca, "ca").unwrap(); - std::fs::write(&cert, "cert").unwrap(); - std::fs::write(&key, "key").unwrap(); - - prepare_guest_tls_materials( - &rootfs, + let source_dir = base.join("missing-source"); + + let err = prepare_guest_tls_materials( + &base.join("rootfs.ext4"), &VmDriverTlsPaths { - ca: ca.clone(), - cert: cert.clone(), - key: key.clone(), + ca: source_dir.join("ca.crt"), + cert: source_dir.join("tls.crt"), + key: source_dir.join("tls.key"), }, ) .await - .unwrap(); + .expect_err("missing TLS materials should fail before image injection"); - let guest_dir = rootfs.join(GUEST_TLS_DIR.trim_start_matches('/')); - assert_eq!( - std::fs::read_to_string(guest_dir.join("ca.crt")).unwrap(), - "ca" - ); - assert_eq!( - std::fs::read_to_string(guest_dir.join("tls.crt")).unwrap(), - "cert" - ); - assert_eq!( - std::fs::read_to_string(guest_dir.join("tls.key")).unwrap(), - "key" - ); - let key_mode = std::fs::metadata(guest_dir.join("tls.key")) - .unwrap() - .permissions() - .mode() - & 0o777; - assert_eq!(key_mode, 0o600); + assert!(err.contains("ca.crt")); let _ = std::fs::remove_dir_all(base); } diff --git a/crates/openshell-driver-vm/src/ffi.rs b/crates/openshell-driver-vm/src/ffi.rs index db5d3ec10..423ad6f05 100644 --- a/crates/openshell-driver-vm/src/ffi.rs +++ b/crates/openshell-driver-vm/src/ffi.rs @@ -29,7 +29,18 @@ type KrunInitLog = type KrunCreateCtx = unsafe extern "C" fn() -> i32; type KrunFreeCtx = unsafe extern "C" fn(ctx_id: u32) -> i32; type KrunSetVmConfig = unsafe extern "C" fn(ctx_id: u32, num_vcpus: u8, ram_mib: u32) -> i32; -type KrunSetRoot = unsafe extern "C" fn(ctx_id: u32, root_path: *const c_char) -> i32; +type KrunAddDisk = unsafe extern "C" fn( + ctx_id: u32, + block_id: *const c_char, + disk_path: *const c_char, + read_only: bool, +) -> i32; +type KrunSetRootDiskRemount = unsafe extern "C" fn( + ctx_id: u32, + device: *const c_char, + fstype: *const c_char, + options: *const c_char, +) -> i32; type KrunSetWorkdir = unsafe extern "C" fn(ctx_id: u32, workdir_path: *const c_char) -> i32; type KrunSetExec = unsafe extern "C" fn( ctx_id: u32, @@ -67,7 +78,8 @@ pub struct LibKrun { pub krun_create_ctx: KrunCreateCtx, pub krun_free_ctx: KrunFreeCtx, pub krun_set_vm_config: KrunSetVmConfig, - pub krun_set_root: KrunSetRoot, + pub krun_add_disk: KrunAddDisk, + pub krun_set_root_disk_remount: KrunSetRootDiskRemount, pub krun_set_workdir: KrunSetWorkdir, pub krun_set_exec: KrunSetExec, pub krun_set_console_output: KrunSetConsoleOutput, @@ -119,7 +131,12 @@ impl LibKrun { krun_create_ctx: load_symbol(library, b"krun_create_ctx\0", &libkrun_path)?, krun_free_ctx: load_symbol(library, b"krun_free_ctx\0", &libkrun_path)?, krun_set_vm_config: load_symbol(library, b"krun_set_vm_config\0", &libkrun_path)?, - krun_set_root: load_symbol(library, b"krun_set_root\0", &libkrun_path)?, + krun_add_disk: load_symbol(library, b"krun_add_disk\0", &libkrun_path)?, + krun_set_root_disk_remount: load_symbol( + library, + b"krun_set_root_disk_remount\0", + &libkrun_path, + )?, krun_set_workdir: load_symbol(library, b"krun_set_workdir\0", &libkrun_path)?, krun_set_exec: load_symbol(library, b"krun_set_exec\0", &libkrun_path)?, krun_set_console_output: load_symbol( diff --git a/crates/openshell-driver-vm/src/main.rs b/crates/openshell-driver-vm/src/main.rs index ed9967f4a..edc878f91 100644 --- a/crates/openshell-driver-vm/src/main.rs +++ b/crates/openshell-driver-vm/src/main.rs @@ -27,8 +27,8 @@ struct Args { #[arg(long, hide = true, default_value_t = false)] internal_run_vm: bool, - #[arg(long, hide = true)] - vm_rootfs: Option, + #[arg(long = "vm-root-disk", hide = true, alias = "vm-rootfs")] + vm_root_disk: Option, #[arg(long, hide = true)] vm_exec: Option, @@ -448,10 +448,10 @@ impl Stream for AuthenticatedUnixIncoming { } fn build_vm_launch_config(args: &Args) -> std::result::Result { - let rootfs = args - .vm_rootfs + let root_disk = args + .vm_root_disk .clone() - .ok_or_else(|| "--vm-rootfs is required in internal VM mode".to_string())?; + .ok_or_else(|| "--vm-root-disk is required in internal VM mode".to_string())?; let exec_path = args .vm_exec .clone() @@ -468,7 +468,7 @@ fn build_vm_launch_config(args: &Args) -> std::result::Result &'static str { SANDBOX_GUEST_INIT_PATH @@ -44,6 +51,7 @@ pub fn extract_rootfs_archive_to(archive_path: &Path, dest: &Path) -> Result<(), .map_err(|e| format!("extract rootfs tarball into {}: {e}", dest.display())) } +#[cfg(test)] pub fn create_rootfs_archive_from_dir(source: &Path, archive_path: &Path) -> Result<(), String> { if let Some(parent) = archive_path.parent() { fs::create_dir_all(parent).map_err(|e| format!("create {}: {e}", parent.display()))?; @@ -65,6 +73,104 @@ pub fn create_rootfs_archive_from_dir(source: &Path, archive_path: &Path) -> Res .map_err(|e| format!("finalize {}: {e}", archive_path.display())) } +pub fn create_rootfs_image_from_dir(source: &Path, image_path: &Path) -> Result<(), String> { + if let Some(parent) = image_path.parent() { + fs::create_dir_all(parent).map_err(|e| format!("create {}: {e}", parent.display()))?; + } + if image_path.exists() { + fs::remove_file(image_path) + .map_err(|e| format!("remove old rootfs image {}: {e}", image_path.display()))?; + } + + let image_size = rootfs_image_size_bytes(source)?; + let image = File::create(image_path) + .map_err(|e| format!("create rootfs image {}: {e}", image_path.display()))?; + image + .set_len(image_size) + .map_err(|e| format!("size rootfs image {}: {e}", image_path.display()))?; + drop(image); + + if let Err(err) = format_ext4_image_from_dir(source, image_path) { + let _ = fs::remove_file(image_path); + return Err(err); + } + + Ok(()) +} + +pub fn copy_rootfs_image_to(source: &Path, dest: &Path) -> Result<(), String> { + if let Some(parent) = dest.parent() { + fs::create_dir_all(parent).map_err(|e| format!("create {}: {e}", parent.display()))?; + } + if dest.exists() { + fs::remove_file(dest) + .map_err(|e| format!("remove old rootfs image {}: {e}", dest.display()))?; + } + + let mut input = File::open(source) + .map_err(|e| format!("open cached rootfs image {}: {e}", source.display()))?; + let mut output = + File::create(dest).map_err(|e| format!("create rootfs image {}: {e}", dest.display()))?; + let mut buf = vec![0; 1024 * 1024]; + let mut total = 0_u64; + + loop { + let len = input + .read(&mut buf) + .map_err(|e| format!("read rootfs image {}: {e}", source.display()))?; + if len == 0 { + break; + } + total += len as u64; + if buf[..len].iter().all(|byte| *byte == 0) { + let offset = i64::try_from(len).map_err(|e| { + format!( + "convert sparse rootfs image seek offset for {}: {e}", + dest.display() + ) + })?; + output + .seek(SeekFrom::Current(offset)) + .map_err(|e| format!("seek sparse rootfs image {}: {e}", dest.display()))?; + } else { + output + .write_all(&buf[..len]) + .map_err(|e| format!("write rootfs image {}: {e}", dest.display()))?; + } + } + + output + .set_len(total) + .map_err(|e| format!("finalize rootfs image {}: {e}", dest.display())) +} + +pub fn write_rootfs_image_file( + image_path: &Path, + guest_path: &str, + contents: &[u8], +) -> Result<(), String> { + ensure_rootfs_image_parent_dirs(image_path, guest_path); + + let tmp_path = temporary_injection_path(image_path); + fs::write(&tmp_path, contents).map_err(|e| format!("write {}: {e}", tmp_path.display()))?; + let _ = run_debugfs(image_path, &format!("rm {guest_path}")); + let result = run_debugfs( + image_path, + &format!("write {} {}", tmp_path.display(), guest_path), + ); + let _ = fs::remove_file(&tmp_path); + result +} + +pub fn set_rootfs_image_file_mode( + image_path: &Path, + guest_path: &str, + mode: u32, +) -> Result<(), String> { + run_debugfs(image_path, &format!("sif {guest_path} mode {mode:o}")) +} + +#[cfg(test)] fn append_rootfs_tree_to_archive( builder: &mut tar::Builder>, source: &Path, @@ -119,6 +225,7 @@ fn append_rootfs_tree_to_archive( Ok(()) } +#[cfg(test)] fn append_symlink_to_archive( builder: &mut tar::Builder>, source_path: &Path, @@ -165,6 +272,7 @@ fn prepare_sandbox_rootfs(rootfs: &Path) -> Result<(), String> { fs::write(opt_dir.join(".rootfs-type"), "sandbox\n") .map_err(|e| format!("write sandbox rootfs marker: {e}"))?; ensure_sandbox_guest_user(rootfs)?; + create_sandbox_mountpoint(&rootfs.join("sandbox"))?; Ok(()) } @@ -182,6 +290,159 @@ pub fn validate_sandbox_rootfs(rootfs: &Path) -> Result<(), String> { Ok(()) } +fn create_sandbox_mountpoint(path: &Path) -> Result<(), String> { + fs::create_dir_all(path).map_err(|e| format!("create {}: {e}", path.display()))?; + #[cfg(unix)] + { + use std::os::unix::fs::PermissionsExt as _; + + fs::set_permissions(path, fs::Permissions::from_mode(0o755)) + .map_err(|e| format!("chmod {}: {e}", path.display()))?; + } + Ok(()) +} + +fn rootfs_image_size_bytes(source: &Path) -> Result { + let used = directory_size_bytes(source)?; + let headroom = (used / 4).max(ROOTFS_IMAGE_MIN_HEADROOM_BYTES); + let size = (used + headroom).max(ROOTFS_IMAGE_MIN_SIZE_BYTES); + Ok(round_up_to_mib(size)) +} + +fn directory_size_bytes(path: &Path) -> Result { + let metadata = + fs::symlink_metadata(path).map_err(|e| format!("stat {}: {e}", path.display()))?; + if metadata.file_type().is_file() || metadata.file_type().is_symlink() { + return Ok(metadata.len()); + } + if !metadata.file_type().is_dir() { + return Ok(0); + } + + let mut size = 4096; + for entry in fs::read_dir(path).map_err(|e| format!("read {}: {e}", path.display()))? { + let entry = entry.map_err(|e| format!("read {}: {e}", path.display()))?; + size += directory_size_bytes(&entry.path())?; + } + Ok(size) +} + +fn round_up_to_mib(bytes: u64) -> u64 { + const MIB: u64 = 1024 * 1024; + bytes.div_ceil(MIB) * MIB +} + +fn format_ext4_image_from_dir(source: &Path, image_path: &Path) -> Result<(), String> { + let mut last_error = None; + for tool in ["mke2fs", "mkfs.ext4"] { + for candidate in e2fs_tool_candidates(tool) { + let label = candidate.display().to_string(); + let output = Command::new(&candidate) + .arg("-q") + .arg("-F") + .arg("-t") + .arg("ext4") + .arg("-E") + .arg("root_owner=0:0") + .arg("-d") + .arg(source) + .arg(image_path) + .output(); + match output { + Ok(output) if output.status.success() => return Ok(()), + Ok(output) => { + last_error = Some(format!( + "{label} failed with status {}\nstdout: {}\nstderr: {}", + output.status, + String::from_utf8_lossy(&output.stdout), + String::from_utf8_lossy(&output.stderr) + )); + } + Err(err) if err.kind() == std::io::ErrorKind::NotFound => { + last_error = Some(format!("{label} not found")); + } + Err(err) => { + last_error = Some(format!("run {label}: {err}")); + } + } + } + } + Err(format!( + "failed to create ext4 rootfs image from {}: {}. Install e2fsprogs (mke2fs/mkfs.ext4) and retry", + source.display(), + last_error.unwrap_or_else(|| "no ext4 formatter found".to_string()) + )) +} + +fn ensure_rootfs_image_parent_dirs(image_path: &Path, guest_path: &str) { + let Some(parent) = Path::new(guest_path).parent() else { + return; + }; + let mut current = String::new(); + for component in parent.components() { + let part = component.as_os_str().to_string_lossy(); + if part == "/" || part.is_empty() { + continue; + } + current.push('/'); + current.push_str(&part); + let _ = run_debugfs(image_path, &format!("mkdir {current}")); + } +} + +fn run_debugfs(image_path: &Path, command: &str) -> Result<(), String> { + let mut last_error = None; + for candidate in e2fs_tool_candidates("debugfs") { + let label = candidate.display().to_string(); + let output = Command::new(&candidate) + .arg("-w") + .arg("-R") + .arg(command) + .arg(image_path) + .output(); + match output { + Ok(output) if output.status.success() => return Ok(()), + Ok(output) => { + last_error = Some(format!( + "{label} failed with status {}\nstdout: {}\nstderr: {}", + output.status, + String::from_utf8_lossy(&output.stdout), + String::from_utf8_lossy(&output.stderr) + )); + } + Err(err) if err.kind() == std::io::ErrorKind::NotFound => { + last_error = Some(format!("{label} not found")); + } + Err(err) => { + last_error = Some(format!("run {label}: {err}")); + } + } + } + Err(format!( + "debugfs command '{command}' failed for {}: {}. Install e2fsprogs (debugfs) and retry", + image_path.display(), + last_error.unwrap_or_else(|| "debugfs not found".to_string()) + )) +} + +fn e2fs_tool_candidates(tool: &str) -> Vec { + let mut candidates = vec![PathBuf::from(tool)]; + for root in ["/opt/homebrew/opt/e2fsprogs", "/usr/local/opt/e2fsprogs"] { + candidates.push(Path::new(root).join("sbin").join(tool)); + candidates.push(Path::new(root).join("bin").join(tool)); + } + candidates +} + +fn temporary_injection_path(image_path: &Path) -> PathBuf { + let n = INJECTION_COUNTER.fetch_add(1, Ordering::Relaxed); + let parent = image_path.parent().unwrap_or_else(|| Path::new(".")); + parent.join(format!( + ".openshell-rootfs-inject-{}-{n}", + std::process::id() + )) +} + fn ensure_sandbox_guest_user(rootfs: &Path) -> Result<(), String> { const SANDBOX_UID: u32 = 10001; const SANDBOX_GID: u32 = 10001; @@ -343,7 +604,13 @@ mod tests { validate_sandbox_rootfs(&rootfs).expect("validate sandbox rootfs"); assert!(rootfs.join("srv/openshell-vm-sandbox-init.sh").is_file()); - assert!(!rootfs.join("sandbox").exists()); + assert!(rootfs.join("sandbox").is_dir()); + assert!( + fs::read_dir(rootfs.join("sandbox")) + .expect("read sandbox") + .next() + .is_none() + ); assert!( fs::read_to_string(rootfs.join("etc/passwd")) .expect("read passwd") @@ -363,7 +630,7 @@ mod tests { } #[test] - fn prepare_sandbox_rootfs_preserves_image_workdir_contents() { + fn prepare_sandbox_rootfs_preserves_image_workdir_contents_in_rootfs() { let dir = unique_temp_dir(); let rootfs = dir.join("rootfs"); @@ -378,6 +645,7 @@ mod tests { prepare_sandbox_rootfs(&rootfs).expect("prepare sandbox rootfs"); + assert!(rootfs.join("sandbox").is_dir()); assert_eq!( fs::read_to_string(rootfs.join("sandbox/app.py")).expect("read app"), "print('hello')\n" diff --git a/crates/openshell-driver-vm/src/runtime.rs b/crates/openshell-driver-vm/src/runtime.rs index 758808c8e..a10f7efd2 100644 --- a/crates/openshell-driver-vm/src/runtime.rs +++ b/crates/openshell-driver-vm/src/runtime.rs @@ -10,7 +10,7 @@ use std::ptr; use std::sync::atomic::{AtomicI32, Ordering}; use std::time::{Duration, Instant}; -use crate::{embedded_runtime, ffi, procguard}; +use crate::{embedded_runtime, ffi, procguard, rootfs}; pub const VM_RUNTIME_DIR_ENV: &str = "OPENSHELL_VM_RUNTIME_DIR"; @@ -18,7 +18,7 @@ pub const VM_RUNTIME_DIR_ENV: &str = "OPENSHELL_VM_RUNTIME_DIR"; /// Used by the SIGTERM/SIGINT handler to forward signals to the VM. static CHILD_PID: AtomicI32 = AtomicI32::new(0); -/// PID of the helper process (gvproxy for libkrun, virtiofsd for QEMU). +/// PID of the helper process (gvproxy for libkrun; zero for QEMU). /// Zero when not running. Used by the SIGTERM/SIGINT handler and /// procguard cleanup callback to ensure the helper doesn't outlive the /// launcher (especially on macOS where `PR_SET_PDEATHSIG` is absent). @@ -45,7 +45,7 @@ const COMPAT_NET_FEATURES: u32 = NET_FEATURE_CSUM | NET_FEATURE_HOST_UFO; pub struct VmLaunchConfig { - pub rootfs: PathBuf, + pub root_disk: PathBuf, pub vcpus: u8, pub mem_mib: u32, pub exec_path: String, @@ -96,10 +96,10 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { .as_deref() .ok_or("host_ip is required for QEMU backend")?; - if !config.rootfs.is_dir() { + if !config.root_disk.is_file() { return Err(format!( - "rootfs directory not found: {}", - config.rootfs.display() + "root disk image not found: {}", + config.root_disk.display() )); } @@ -111,70 +111,13 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { check_kvm_access()?; let guest_env = qemu_guest_env_vars(config, host_dns_server()); - write_guest_env_file(&config.rootfs, &guest_env)?; - - let rootfs_str = config.rootfs.to_str().ok_or("rootfs path not UTF-8")?; - let sandbox_dir = config.rootfs.parent().unwrap_or(&config.rootfs); - let sock_prefix = tap_device.trim_start_matches("vmtap-"); - let virtiofsd_sock_dir = PathBuf::from(format!("/tmp/ovm-qemu-{sock_prefix}")); - std::fs::create_dir_all(&virtiofsd_sock_dir) - .map_err(|e| format!("create virtiofsd sock dir: {e}"))?; - let virtiofsd_sock = virtiofsd_sock_dir.join("virtiofsd.sock"); - let shm_path = format!("/dev/shm/ovm-qemu-{sock_prefix}"); - - std::fs::create_dir_all(&shm_path).map_err(|e| format!("create shm dir: {e}"))?; + write_guest_env_file(&config.root_disk, &guest_env)?; let runtime_dir = qemu_runtime_dir()?; - let gw_port = config.gateway_port.unwrap_or(0); setup_tap_networking(tap_device, host_ip, gw_port)?; let mut tap_guard = TapGuard::new(tap_device.to_string(), host_ip.to_string(), gw_port); - let virtiofsd_log = sandbox_dir.join("virtiofsd.log"); - let virtiofsd_log_file = - std::fs::File::create(&virtiofsd_log).map_err(|e| format!("create virtiofsd log: {e}"))?; - - let virtiofsd_bin = { - let runtime_virtiofsd = runtime_dir.join("virtiofsd"); - if runtime_virtiofsd.is_file() { - runtime_virtiofsd - } else { - PathBuf::from("virtiofsd") - } - }; - - let mut virtiofsd_cmd = StdCommand::new(&virtiofsd_bin); - virtiofsd_cmd - .arg("--socket-path") - .arg(&virtiofsd_sock) - .arg("--shared-dir") - .arg(rootfs_str) - .arg("--cache=auto") - .stdin(Stdio::null()) - .stdout(Stdio::null()) - .stderr(virtiofsd_log_file); - - #[cfg(target_os = "linux")] - { - use nix::sys::signal::Signal; - use std::os::unix::process::CommandExt as _; - unsafe { - virtiofsd_cmd.pre_exec(|| { - nix::sys::prctl::set_pdeathsig(Signal::SIGKILL) - .map_err(|err| std::io::Error::other(format!("pdeathsig: {err}"))) - }); - } - } - - let virtiofsd_child = virtiofsd_cmd - .spawn() - .map_err(|e| format!("failed to start virtiofsd: {e}"))?; - let virtiofsd_pid = virtiofsd_child.id().cast_signed(); - GVPROXY_PID.store(virtiofsd_pid, Ordering::Relaxed); - let mut virtiofsd_guard = GvproxyGuard::new(virtiofsd_child); - - wait_for_path(&virtiofsd_sock, Duration::from_secs(5), "virtiofsd socket")?; - let vmlinux = runtime_dir.join("vmlinux"); if !vmlinux.is_file() { return Err(format!("VM kernel not found: {}", vmlinux.display())); @@ -198,20 +141,13 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { .arg(&vmlinux) .arg("-append") .arg(&kernel_cmdline) - .arg("-chardev") + .arg("-drive") .arg(format!( - "socket,id=virtiofs,path={}", - virtiofsd_sock.display() + "file={},if=none,format=raw,id=rootfs", + config.root_disk.display() )) .arg("-device") - .arg("vhost-user-fs-pci,chardev=virtiofs,tag=rootfs") - .arg("-object") - .arg(format!( - "memory-backend-memfd,id=mem,size={}M,share=on", - config.mem_mib - )) - .arg("-numa") - .arg("node,memdev=mem") + .arg("virtio-blk-pci,drive=rootfs") .arg("-netdev") .arg(format!( "tap,id=net0,ifname={tap_device},script=no,downscript=no" @@ -263,15 +199,8 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { .map_err(|e| format!("failed to wait for QEMU: {e}"))?; CHILD_PID.store(0, Ordering::Relaxed); - unsafe { - libc::kill(virtiofsd_pid, libc::SIGTERM); - } - virtiofsd_guard.disarm(); - GVPROXY_PID.store(0, Ordering::Relaxed); teardown_tap_networking(tap_device, host_ip, gw_port); tap_guard.disarm(); - let _ = std::fs::remove_dir_all(&shm_path); - let _ = std::fs::remove_dir_all(&virtiofsd_sock_dir); if status.success() { Ok(()) @@ -280,12 +209,11 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { } } -/// Write environment variables into the rootfs so the guest init script -/// can source them. virtiofs shares the host rootfs directory into the guest. -fn write_guest_env_file(rootfs: &Path, env_vars: &[String]) -> Result<(), String> { - let srv_dir = rootfs.join("srv"); - std::fs::create_dir_all(&srv_dir).map_err(|e| format!("create /srv in rootfs: {e}"))?; - let env_file = srv_dir.join("openshell-env.sh"); +/// Write environment variables into the root disk so the guest init script can +/// source them. QEMU does not provide a `krun_set_exec` equivalent, so the +/// launcher injects this small per-sandbox file into the copied root image +/// before boot. +fn write_guest_env_file(root_disk: &Path, env_vars: &[String]) -> Result<(), String> { let mut content = String::new(); for var in env_vars { if let Some((key, value)) = var.split_once('=') { @@ -293,8 +221,7 @@ fn write_guest_env_file(rootfs: &Path, env_vars: &[String]) -> Result<(), String let _ = writeln!(content, "export {key}=\"{}\"", shell_escape(value)); } } - std::fs::write(&env_file, &content).map_err(|e| format!("write guest env file: {e}"))?; - Ok(()) + rootfs::write_rootfs_image_file(root_disk, "/srv/openshell-env.sh", content.as_bytes()) } fn qemu_guest_env_vars(config: &VmLaunchConfig, dns_server: Option) -> Vec { @@ -331,8 +258,8 @@ fn shell_escape(s: &str) -> String { fn build_kernel_cmdline(config: &VmLaunchConfig) -> String { let mut parts = vec![ "console=ttyS0".to_string(), - "root=rootfs".to_string(), - "rootfstype=virtiofs".to_string(), + "root=/dev/vda".to_string(), + "rootfstype=ext4".to_string(), "rw".to_string(), "panic=-1".to_string(), format!("init={}", config.exec_path), @@ -674,10 +601,10 @@ fn procguard_kill_children() { } fn run_libkrun_vm(config: &VmLaunchConfig) -> Result<(), String> { - if !config.rootfs.is_dir() { + if !config.root_disk.is_file() { return Err(format!( - "rootfs directory not found: {}", - config.rootfs.display() + "root disk image not found: {}", + config.root_disk.display() )); } @@ -702,7 +629,7 @@ fn run_libkrun_vm(config: &VmLaunchConfig) -> Result<(), String> { let vm = VmContext::create(&runtime_dir, config.log_level)?; vm.set_vm_config(config.vcpus, config.mem_mib)?; - vm.set_root(&config.rootfs)?; + vm.set_root_disk(&config.root_disk)?; vm.set_workdir(&config.workdir)?; // Run gvproxy strictly as the guest's virtual NIC / DHCP / router. @@ -749,12 +676,12 @@ fn run_libkrun_vm(config: &VmLaunchConfig) -> Result<(), String> { )); } - let sock_base = gvproxy_socket_base(&config.rootfs)?; + let sock_base = gvproxy_socket_base(&config.root_disk)?; let net_sock = sock_base.with_extension("v"); let _ = std::fs::remove_file(&net_sock); let _ = std::fs::remove_file(sock_base.with_extension("v-krun.sock")); - let run_dir = config.rootfs.parent().unwrap_or(&config.rootfs); + let run_dir = config.root_disk.parent().unwrap_or(&config.root_disk); let gvproxy_log = run_dir.join("gvproxy.log"); let gvproxy_log_file = std::fs::File::create(&gvproxy_log) .map_err(|e| format!("create gvproxy log {}: {e}", gvproxy_log.display()))?; @@ -1013,11 +940,37 @@ impl VmContext { ) } - fn set_root(&self, rootfs: &Path) -> Result<(), String> { - let rootfs_c = path_to_cstring(rootfs)?; + fn set_root_disk(&self, root_disk: &Path) -> Result<(), String> { + let root_disk_c = path_to_cstring(root_disk)?; + let block_id_c = CString::new("root").map_err(|e| format!("invalid block id: {e}"))?; + check( + unsafe { + (self.krun.krun_add_disk)( + self.ctx_id, + block_id_c.as_ptr(), + root_disk_c.as_ptr(), + false, + ) + }, + "krun_add_disk", + )?; + + let device_c = + CString::new("/dev/vda").map_err(|e| format!("invalid root disk device: {e}"))?; + let fstype_c = + CString::new("ext4").map_err(|e| format!("invalid root disk fstype: {e}"))?; + let options_c = + CString::new("rw").map_err(|e| format!("invalid root disk options: {e}"))?; check( - unsafe { (self.krun.krun_set_root)(self.ctx_id, rootfs_c.as_ptr()) }, - "krun_set_root", + unsafe { + (self.krun.krun_set_root_disk_remount)( + self.ctx_id, + device_c.as_ptr(), + fstype_c.as_ptr(), + options_c.as_ptr(), + ) + }, + "krun_set_root_disk_remount", ) } @@ -1234,8 +1187,8 @@ fn secure_socket_base(subdir: &str) -> Result { Ok(dir) } -fn gvproxy_socket_base(rootfs: &Path) -> Result { - Ok(secure_socket_base("osd-gv")?.join(hash_path_id(rootfs))) +fn gvproxy_socket_base(root_disk: &Path) -> Result { + Ok(secure_socket_base("osd-gv")?.join(hash_path_id(root_disk))) } fn install_signal_forwarding(pid: i32) { @@ -1342,7 +1295,7 @@ mod tests { fn qemu_config() -> VmLaunchConfig { VmLaunchConfig { - rootfs: PathBuf::from("/rootfs"), + root_disk: PathBuf::from("/rootfs.ext4"), vcpus: 2, mem_mib: 2048, exec_path: "/srv/openshell-vm-sandbox-init.sh".to_string(), @@ -1377,6 +1330,8 @@ mod tests { fn kernel_cmdline_keeps_guest_init_metadata_out_of_proc_cmdline() { let cmdline = build_kernel_cmdline(&qemu_config()); + assert!(cmdline.contains("root=/dev/vda")); + assert!(cmdline.contains("rootfstype=ext4")); assert!(cmdline.contains("ip=10.0.128.2::10.0.128.1:255.255.255.252:sandbox::off")); assert!(cmdline.contains("firmware_class.path=/lib/firmware")); assert!(!cmdline.contains("VM_NET_IP=")); diff --git a/docs/reference/sandbox-compute-drivers.mdx b/docs/reference/sandbox-compute-drivers.mdx index 5fea1b6ae..b557ed164 100644 --- a/docs/reference/sandbox-compute-drivers.mdx +++ b/docs/reference/sandbox-compute-drivers.mdx @@ -73,6 +73,8 @@ MicroVM-backed sandboxes run inside VM-backed isolation instead of a container b The gateway uses the VM compute driver to create VM-backed sandboxes. MicroVM requires host virtualization support. It uses [libkrun](https://github.com/containers/libkrun) with Apple's [Hypervisor framework](https://developer.apple.com/documentation/hypervisor) on macOS, KVM on Linux, and [QEMU](https://www.qemu.org/) for GPU-backed sandboxes on Linux. +The VM driver prepares an ext4 root disk image from the selected sandbox image and boots each sandbox from its own copy. Host state owns the disk image file; guest UID/GID ownership, including `/sandbox`, lives inside the ext4 filesystem metadata. + For maintainer-level implementation details, refer to the [VM driver README](https://github.com/NVIDIA/OpenShell/blob/main/crates/openshell-driver-vm/README.md). | Option | Environment variable | Description | diff --git a/e2e/rust/e2e-vm.sh b/e2e/rust/e2e-vm.sh index a4afb16d5..149deb29d 100755 --- a/e2e/rust/e2e-vm.sh +++ b/e2e/rust/e2e-vm.sh @@ -56,9 +56,9 @@ DRIVER_BIN="${ROOT}/target/debug/openshell-driver-vm" STATE_DIR_ROOT="/tmp" # Smoke test timeouts. First boot extracts the embedded libkrun runtime -# (~60-90MB of zstd per architecture) and prepares a sandbox rootfs from the -# configured image. The guest then starts the sandbox supervisor directly; a -# cold microVM is typically ready within ~15s after image preparation. +# (~60-90MB of zstd per architecture) and prepares an ext4 root disk from the +# configured image. The guest then starts the sandbox supervisor directly; a cold +# microVM is typically ready within ~15s after image preparation. GATEWAY_READY_TIMEOUT=60 SANDBOX_PROVISION_TIMEOUT=180 @@ -104,7 +104,7 @@ s.close()')" # Per-run state dir so concurrent e2e runs don't collide on the UDS or # sandbox state. The VM driver creates `/compute-driver.sock` -# and `/sandboxes//rootfs/` under here. Keep the +# and `/sandboxes//rootfs.ext4` under here. Keep the # basename short — see the SUN_LEN comment above. RUN_STATE_DIR="${STATE_DIR_ROOT}/os-vm-e2e-${HOST_PORT}-$$" mkdir -p "${RUN_STATE_DIR}" @@ -147,7 +147,7 @@ cleanup() { rm -f "${GATEWAY_LOG}" 2>/dev/null || true # Only wipe the per-run state dir on success. On failure, leave it for - # post-mortem (serial console logs, gvproxy logs, rootfs dumps). + # post-mortem (serial console logs, gvproxy logs, root disk images). if [ "${exit_code}" -eq 0 ]; then rm -rf "${RUN_STATE_DIR}" 2>/dev/null || true else @@ -220,10 +220,10 @@ echo "==> Gateway ready after ${elapsed}s" export OPENSHELL_GATEWAY_ENDPOINT="http://127.0.0.1:${HOST_PORT}" -# The VM driver creates each sandbox VM from scratch — the embedded -# rootfs is extracted per sandbox, and the guest's sandbox supervisor -# then initializes policy, netns, Landlock, and sshd. On a cold host -# this is ~15s; allow 180s for slower CI runners. +# The VM driver creates each sandbox VM from a copied ext4 root disk, and the +# guest's sandbox supervisor then initializes policy, netns, Landlock, and sshd. +# On a cold host this is ~15s after image preparation; allow 180s for slower CI +# runners. export OPENSHELL_PROVISION_TIMEOUT="${SANDBOX_PROVISION_TIMEOUT}" echo "==> Running e2e smoke test (endpoint: ${OPENSHELL_GATEWAY_ENDPOINT})"