Add Nsight GPU metrics collection by jordanhubbard · Pull Request #307 · coroot/coroot-node-agent

jordanhubbard · 2026-05-07T06:19:22Z

Summary

This adds an Nsight Systems GPU metrics path for node-level NVIDIA GPU telemetry while keeping the existing NVML path as the fallback.

GPU hosts deserve special treatment in Coroot because the accelerator is usually the scarce and expensive resource on the node. CPU and memory can look healthy while an enterprise GPU fleet is either idle, bottlenecked, or saturated. For teams paying for H100/A100/L40-class instances or on-prem accelerator capacity, utilization and occupancy are core signals for capacity planning, workload placement, and return on infrastructure spend.

What changed

Detects a usable nsys binary from the container path or host-mounted Nsight Systems installations and validates that GPU metrics are available before enabling the collector.
Runs a short periodic Nsight Systems capture aligned with the scrape interval and extracts GPU metrics through a custom nsys stats report.
Uses Nsight GPU metrics for node-level average and peak GPU utilization from GR Active [Throughput %].
Uses Nsight DRAM read/write bandwidth metrics for node-level average and peak GPU memory utilization.
Adds new node-level compute occupancy metrics:
- node_resources_gpu_compute_occupancy_percent_avg
- node_resources_gpu_compute_occupancy_percent_peak
Keeps NVML for GPU inventory, memory totals/used bytes, temperature, power, process/container utilization samples, and fallback node utilization when Nsight is not installed, unsupported, or stale.
Handles host-installed Nsight Systems by chrooting into /proc/1/root when needed so generated reports and paths resolve correctly on the host filesystem.
Adds unit coverage for Nsight CSV parsing, UUID normalization, host path translation, and the no-GPU guard; leaves the live Nsight integration test opt-in via COROOT_TEST_NSIGHT=1.

Why utilization and occupancy both matter

GPU utilization is the first signal that expensive accelerator capacity is doing work, but it does not fully explain whether kernels are keeping the device effectively occupied. Compute occupancy adds the missing enterprise operations signal: it helps distinguish work that merely touches the GPU from work that keeps streaming multiprocessors busy. Together with memory bandwidth utilization, these metrics make it easier to identify idle GPUs, memory-bound workloads, inefficient kernels, and saturation before the only visible symptom is poor application throughput.

Companion UI work

Companion Coroot UI/backend PR: coroot/coroot#888. That PR consumes the compute occupancy metrics added here and surfaces GPU utilization, memory utilization, and SM occupancy in the Nodes overview and GPU audit views.

Testing

docker run --rm -v "$PWD":/src -w /src golang:1.25-bookworm sh -c 'apt-get update >/tmp/apt-update.log && apt-get install -y libsystemd-dev >/tmp/apt-install.log && /usr/local/go/bin/go test ./...'
Not run by default: COROOT_TEST_NSIGHT=1 go test ./gpu -run TestNsightCollectorLive, because it requires a GPU host with Nsight Systems installed.

Add Nsight GPU metrics collection

5be5ade

jordanhubbard mentioned this pull request May 7, 2026

Surface GPU utilization and occupancy in node views coroot/coroot#888

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nsight GPU metrics collection#307

Add Nsight GPU metrics collection#307
jordanhubbard wants to merge 1 commit intocoroot:mainfrom
jordanhubbard:nsight-gpu-metrics

jordanhubbard commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jordanhubbard commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why utilization and occupancy both matter

Companion UI work

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jordanhubbard commented May 7, 2026 •

edited

Loading