Add Nsight GPU metrics collection#307
Open
jordanhubbard wants to merge 1 commit intocoroot:mainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This adds an Nsight Systems GPU metrics path for node-level NVIDIA GPU telemetry while keeping the existing NVML path as the fallback.
GPU hosts deserve special treatment in Coroot because the accelerator is usually the scarce and expensive resource on the node. CPU and memory can look healthy while an enterprise GPU fleet is either idle, bottlenecked, or saturated. For teams paying for H100/A100/L40-class instances or on-prem accelerator capacity, utilization and occupancy are core signals for capacity planning, workload placement, and return on infrastructure spend.
What changed
nsysbinary from the container path or host-mounted Nsight Systems installations and validates that GPU metrics are available before enabling the collector.nsys statsreport.GR Active [Throughput %].node_resources_gpu_compute_occupancy_percent_avgnode_resources_gpu_compute_occupancy_percent_peak/proc/1/rootwhen needed so generated reports and paths resolve correctly on the host filesystem.COROOT_TEST_NSIGHT=1.Why utilization and occupancy both matter
GPU utilization is the first signal that expensive accelerator capacity is doing work, but it does not fully explain whether kernels are keeping the device effectively occupied. Compute occupancy adds the missing enterprise operations signal: it helps distinguish work that merely touches the GPU from work that keeps streaming multiprocessors busy. Together with memory bandwidth utilization, these metrics make it easier to identify idle GPUs, memory-bound workloads, inefficient kernels, and saturation before the only visible symptom is poor application throughput.
Companion UI work
Companion Coroot UI/backend PR: coroot/coroot#888. That PR consumes the compute occupancy metrics added here and surfaces GPU utilization, memory utilization, and SM occupancy in the Nodes overview and GPU audit views.
Testing
docker run --rm -v "$PWD":/src -w /src golang:1.25-bookworm sh -c 'apt-get update >/tmp/apt-update.log && apt-get install -y libsystemd-dev >/tmp/apt-install.log && /usr/local/go/bin/go test ./...'COROOT_TEST_NSIGHT=1 go test ./gpu -run TestNsightCollectorLive, because it requires a GPU host with Nsight Systems installed.