Skip to content

Add Nsight GPU metrics collection#307

Open
jordanhubbard wants to merge 1 commit intocoroot:mainfrom
jordanhubbard:nsight-gpu-metrics
Open

Add Nsight GPU metrics collection#307
jordanhubbard wants to merge 1 commit intocoroot:mainfrom
jordanhubbard:nsight-gpu-metrics

Conversation

@jordanhubbard
Copy link
Copy Markdown

@jordanhubbard jordanhubbard commented May 7, 2026

Summary

This adds an Nsight Systems GPU metrics path for node-level NVIDIA GPU telemetry while keeping the existing NVML path as the fallback.

GPU hosts deserve special treatment in Coroot because the accelerator is usually the scarce and expensive resource on the node. CPU and memory can look healthy while an enterprise GPU fleet is either idle, bottlenecked, or saturated. For teams paying for H100/A100/L40-class instances or on-prem accelerator capacity, utilization and occupancy are core signals for capacity planning, workload placement, and return on infrastructure spend.

What changed

  • Detects a usable nsys binary from the container path or host-mounted Nsight Systems installations and validates that GPU metrics are available before enabling the collector.
  • Runs a short periodic Nsight Systems capture aligned with the scrape interval and extracts GPU metrics through a custom nsys stats report.
  • Uses Nsight GPU metrics for node-level average and peak GPU utilization from GR Active [Throughput %].
  • Uses Nsight DRAM read/write bandwidth metrics for node-level average and peak GPU memory utilization.
  • Adds new node-level compute occupancy metrics:
    • node_resources_gpu_compute_occupancy_percent_avg
    • node_resources_gpu_compute_occupancy_percent_peak
  • Keeps NVML for GPU inventory, memory totals/used bytes, temperature, power, process/container utilization samples, and fallback node utilization when Nsight is not installed, unsupported, or stale.
  • Handles host-installed Nsight Systems by chrooting into /proc/1/root when needed so generated reports and paths resolve correctly on the host filesystem.
  • Adds unit coverage for Nsight CSV parsing, UUID normalization, host path translation, and the no-GPU guard; leaves the live Nsight integration test opt-in via COROOT_TEST_NSIGHT=1.

Why utilization and occupancy both matter

GPU utilization is the first signal that expensive accelerator capacity is doing work, but it does not fully explain whether kernels are keeping the device effectively occupied. Compute occupancy adds the missing enterprise operations signal: it helps distinguish work that merely touches the GPU from work that keeps streaming multiprocessors busy. Together with memory bandwidth utilization, these metrics make it easier to identify idle GPUs, memory-bound workloads, inefficient kernels, and saturation before the only visible symptom is poor application throughput.

Companion UI work

Companion Coroot UI/backend PR: coroot/coroot#888. That PR consumes the compute occupancy metrics added here and surfaces GPU utilization, memory utilization, and SM occupancy in the Nodes overview and GPU audit views.

Testing

  • docker run --rm -v "$PWD":/src -w /src golang:1.25-bookworm sh -c 'apt-get update >/tmp/apt-update.log && apt-get install -y libsystemd-dev >/tmp/apt-install.log && /usr/local/go/bin/go test ./...'
  • Not run by default: COROOT_TEST_NSIGHT=1 go test ./gpu -run TestNsightCollectorLive, because it requires a GPU host with Nsight Systems installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant