Skip to content

Nalin-Angrish/GPUMan

Repository files navigation

GPUMan

GPUMan is a reliability tool designed for shared GPU environments. It prevents a single misbehaving process from triggering a catastrophic, system-wide CUDA Out-of-Memory (OOM) failure.

TL;DR: GPUMan prevents a single CUDA process from crashing all other GPU workloads by enforcing per-process VRAM limits using NVML.

Key Features

  • Per-Process Limits: Enforces strict memory ceilings for individual CUDA workloads.
  • Proactive Enforcement: Monitors usage via NVML and terminates offending processes before they can destabilize the GPU.
  • Non-Invasive: Provides a practical safeguard without requiring specialized hardware features (like MIG) or complex code changes.
  • Operational Safety: Protects shared workstations, CI runners, and ad-hoc services from "noisy neighbor" crashes.

Problem: Lack of Memory Isolation in Shared GPU Environments

Modern GPU infrastructure is increasingly shared — whether across multiple users on a single workstation, concurrent jobs on a cluster node, or automated CI/CD runners. Despite this, CUDA provides no native per-process memory isolation on consumer or mid-range enterprise hardware.

This lack of isolation leads to a critical operational failure mode:

  • Global Instability: When a single misbehaving process exhausts available VRAM, the resulting Out-of-Memory (OOM) error is not isolated. It often causes all active CUDA contexts on that GPU to fail simultaneously.
  • Workload Disruption: Unrelated processes—including long-running training jobs or production inference services—can crash or become unresponsive, leading to significant data loss and downtime.
  • The "Noisy Neighbor" Effect: In a shared environment, a single user’s error can effectively destabilize the entire system, rendering the hardware unusable for others.

GPUMan addresses these risks by providing a practical enforcement layer. By monitoring memory via NVIDIA’s NVML interface and enforcing strict per-process limits, it terminates offending workloads before they can reach the threshold of a catastrophic, GPU-wide failure.

Who GPUMan Is For

GPUMan is intended for:

  • Shared research workstations
  • CI runners with GPU access
  • Multi-user development servers
  • Ad-hoc inference or training environments

It is not a replacement for hardware isolation (MIG) or a full scheduler.

Architecture

GPUMan consists of two executables:

  • A user facing CLI
  • A long-running daemon

The CLI translates user intent into structured messages sent to the daemon via IPC (Unix Domain Sockets). It also uses the information to update a manifest of managed processes stored on disk.

The daemon owns all privileged operations such as process supervision and GPU monitoring. On startup, the daemon restores and supervises all processes defined in the on-disk manifest. It periodically polls NVML for memory usage statistics and compares them against the limits specified by the user. If a process exceeds its configured limit, the daemon terminates it before a global CUDA OOM can occur.

The daemon also listens for commands from the CLI to add, remove, restart, or update managed processes. Also, the daemon can report the current status of GPU memory usage and the list of managed processes. The CLI provides a user-friendly interface to interact with these features.

Building GPUMan

On my development system, I am using Ubuntu 24.04 LTS with an NVIDIA RTX 4050 GPU and the proprietary NVIDIA drivers installed.

The project depends on NVML (NVIDIA Management Library) and CUDA, though CUDA is only to develop test binaries for the main program. To install these dependencies on a Debian-based system, run the following commands:

sudo apt install libnvidia-ml-dev nvidia-cuda-toolkit

GPUMan is built using CMake. To build the project, ensure you have CMake and a compatible C++ compiler installed. This can be done on a Debian-based system with the following command:

sudo apt install cmake build-essential clang ninja-build

My attempt is to make everything compiler and generator agnostic, but I'm testing it on a configuration that uses Clang and Ninja and will try my best to make sure that at least this configuration works well. To clone and build the project, run the following commands:

git clone https://github.com/Nalin-Angrish/GPUMan.git
cd GPUMan
cmake --preset ninja-clang
cd build
ninja
sudo ninja install
cd ..
chmod +x postinstall.sh
sudo ./postinstall.sh

Usage

The GPUMan executable and the daemon are installed in the system directories after running ninja install. To run the GPUMan daemon, use the following command:

sudo systemctl daemon-reload
sudo systemctl start gpumand.service

And to enable it to start on boot, use:

sudo systemctl enable gpumand.service

To use the GPUMan CLI, you will need to add your user to the gpuman group. This can be done with the following command:

sudo usermod -aG gpuman $USER

You will need to log out and log back in for the group change to take effect.

To run a CUDA application with GPUMan enforcing a memory limit, use the following command:

gpuman run --tag <a_tag_for_the_process> --command <command_to_execute_in_terminal> --memory <max_memory_in_bytes>

Do note that an executable that is not on the PATH will need its full path specified. For example:

gpuman run --tag test_cuda_app --command $PWD/build/cuda_mem_stable --memory 536870912

Command-Line Options

Usage: gpuman [--help] [--version] {proclist,remove,restart,run,status,update}

GPU memory manager for multi-tenant CUDA workloads

Optional arguments:
  -h, --help     shows help message and exits 
  -v, --version  prints version information and exits 

Subcommands:
  proclist      View all processes using GPU memory
  remove        Remove a running application from GPU memory management
  restart       Restart a running application from GPU memory management
  run           Run a CUDA application with GPU memory management
  status        Show the status of GPU memory usage
  update        Update the GPU memory management configuration for a running application

Detailed help for each subcommand and the arguments that can be passed can be accessed using gpuman <subcommand> --help.

Demo

🔤 This video has no audio — please enable captions for context.

Known Limitations

  • Enforcement is polling-based (not instantaneous)
  • Memory spikes between polls may briefly exceed limits
  • No true isolation (hardware MIG required for that, and is not supported on consumer GPUs)
  • No support for multi-GPU scheduling in v1

These tradeoffs are explicit and documented by design.

Roadmap

GPUMan is intentionally scoped, but designed to evolve. Planned future features include:

  • Allocation Interception

    • Hook cudaMalloc / cudaMallocAsync
    • Enforce limits at allocation time
    • Reduce reliance on polling
  • Kubernetes Integration

    • GPUMan as a device-plugin companion
    • Per-pod GPU memory quotas
    • Node-level GPU protection
  • Soft Limits & Throttling

    • Warning thresholds
    • Graceful degradation
    • Priority-aware enforcement
  • NVIDIA MIG Awareness

    • Support MIG-partitioned GPUs
    • Enforce limits per MIG slice
    • Stronger isolation guarantees
    • I really want this but don't have hardware to test on, so I'm leaving it out for now.

About

A reliability tool designed for multi-tenancy GPU environments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors