GPU plotter for Chia v2 proofs of space (CHIP-48). Produces farmable
.plot2 files byte-identical to the
pos2-chip CPU reference.
Status — work in progress. Plots are byte-identical to the pos2-chip CPU reference and deterministic across runs; performance, AMD/Intel support, and the install/CI story are still evolving. Use
cuda-onlyfor the most-tested path.
Branches:
main— SYCL/AdaptiveCpp port, runs on NVIDIA + AMD + Intel (CUB fast path preserved on NVIDIA).cuda-only— original pure-CUDA path, pick it if you only target NVIDIA. See Performance for the tradeoff.
# Install — needs CUDA Toolkit 12+ (or AdaptiveCpp for AMD/Intel),
# CMake ≥ 3.24, a C++20 compiler, and Rust. See Build for alternatives.
cargo install --git https://github.com/Jsewill/xchplot2
# Plot — 10 × k=28 files, keys derived internally from your BLS pair.
xchplot2 plot -k 28 -n 10 \
-f <farmer-pk-hex> \
-c <pool-contract-xch1-or-txch1> \
-o /mnt/plots
# Multi-GPU — one worker per GPU, round-robin partition.
# (`--devices all` adds a CPU worker too; `--devices gpu` sticks to GPUs.)
xchplot2 plot ... --devices gpuSee Hardware compatibility for GPU / VRAM
/ OS requirements, Build for container / native / CMake
paths, and Use for every flag.
Windows users: this cargo install line works under WSL2; for
native Windows or a non-WSL setup, jump to Windows.
-
GPU:
- NVIDIA, compute capability ≥ 5.0 (Maxwell / GTX 750-class
and newer) via the CUDA fast path. Builds auto-detect the
installed GPU's
compute_capvianvidia-smi; override with$CUDA_ARCHITECTURESfor fat or cross-target builds (see Build). Pre-sm_53 cards lack native FP16 ALUs, butcuda_fp16.hfalls back to fp32 emulation for the half-precision intrinsics — kernels work correctly with the emulation cost. On dual-vendor hosts (e.g. AMD primary + secondary NVIDIA),build.rsalso routes around CUDA 13.x + sm < 75 (the toolkit dropped Maxwell-Volta codegen) so an old NVIDIA card next to a working AMD GPU no longer derails the build. - AMD ROCm via the SYCL / AdaptiveCpp path. Validated on RDNA2
(
gfx1031, RX 6700 XT, 12 GB) — bit-exact parity with the CUDA backend across the sort / bucket-offsets / g_x kernels, and farmable plots end-to-end. ROCm 6.2 required (newer ROCm versions have LLVM packaging breakage — seecompose.yamlrocm-service comments). Build picksACPP_TARGETS=hip:gfxXXXXfromrocminfoautomatically for RDNA2+. Other gfx targets (gfx1030/gfx1100) build cleanly but are untested on real hardware. RDNA1 cards (gfx1010/gfx1011/gfx1012, e.g. Radeon Pro W5700, RX 5700 / 5700 XT) default toACPP_TARGETS=generic(SSCP JIT) — a previous community workaround AOT-spoofed them asgfx1013, but that has been observed to silently produce no-op kernel stubs on at least one W5700 + ROCm 6 + AdaptiveCpp 25.10 setup. Generic SSCP works end-to-end through k=24 parity tests. Two opt-in escape hatches preserved:XCHPLOT2_FORCE_GFX_SPOOF=1to restore the legacy AOT spoof,XCHPLOT2_NO_GFX_SPOOF=1to AOT-target the actual ISA natively (build will fail clearly if AdaptiveCpp doesn't accept it). - Intel oneAPI is wired up but untested.
- CPU (no GPU) via AdaptiveCpp's OpenMP backend. Opt-in with
--cpu(or--devices cpu) — never the default. Plotting is 1-2 orders of magnitude slower than a real GPU; intended for headless CI, GPU-less dev machines, or as an extra worker alongside GPUs (--devices allruns every visible GPU plus a CPU worker on the same batch;--devices gpusticks to GPUs). Build the container withscripts/build-container.sh --gpu cpufor the standalone CPU image (xchplot2:cpu, ~400 MB; no CUDA / ROCm in the image).
- NVIDIA, compute capability ≥ 5.0 (Maxwell / GTX 750-class
and newer) via the CUDA fast path. Builds auto-detect the
installed GPU's
-
VRAM: four tiers, picked automatically based on free device VRAM at k=28. All four produce byte-identical plots.
- Pool (~11 GB device + ~4 GB pinned host): fastest steady-state, used on 12 GB+ cards.
- Plain streaming (~7.3 GB peak + 128 MB margin): per-plot allocations, no pinned-host parks, single-pass T2 match. ~400 ms/ plot faster than compact. Used on 10-11 GB cards that can't fit the pool but have headroom above compact.
- Compact streaming (~5.2 GB peak + 128 MB margin): full park/rehydrate + N=2 T2 match tiling. Used on 6-8 GB cards where plain won't fit. 6 GB cards (RTX 2060, RX 6600) are on the edge; 8 GB cards (3070, 2070 Super) comfortably fit.
- Minimal streaming (~3.76 GB peak + 128 MB margin): six layered cuts on top of compact — N=8 T2 match staging, tiled gathers in T1/T2 sort, sliced T1 match (per section_l), sliced T3 match (T2 inputs parked on host, slice H2D'd per section pair), per-tile CUB outputs in T1/T2/T3 sort with USM-host merges, and tiled Xs gen+sort+pack with host-pinned accumulation. Bottleneck moves from compact's T1 sort (5200 MB) to T3 match (3754 MB). Targets 5 GiB+ cards (RTX 2060, RX 6600 XT, RX 7600) comfortably; 4 GiB cards (GTX 1050 Ti, RTX 3050 4GB, MX450) are an edge case since real 4 GiB hardware reports ~3.5 GiB free post-CUDA-context. Trade-off: ~6 extra cap-sized PCIe round-trips per plot. k=28 wall on sm_89: ~34 s/plot vs ~13 s for compact. Detailed breakdown in VRAM.
With
--devices, each worker picks its own tier from its own GPU's free VRAM — heterogeneous rigs (e.g. one 12 GB + one 8 GB card) plot concurrently with each device on its matching tier. -
PCIe: Gen4 x16 or wider recommended. A physically narrower slot (e.g. Gen4 x4) adds ~240 ms per plot to the final fragment D2H copy; check
cat /sys/bus/pci/devices/*/current_link_widthunder load if throughput looks off. -
Host RAM: ≥ 16 GB recommended;
batchmode pins ~4 GB of host memory for D2H double-buffering (pool or streaming). -
CUDA Toolkit: 12+ required for the NVIDIA build path (tested on 13.x). Skipped automatically on AMD/Intel builds where
nvccisn't available —build.rsrunsnvcc --versionand flipsXCHPLOT2_BUILD_CUDA=OFFwhen missing. The toolkit-vs-arch matrix:sm_50–sm_72(Maxwell / Pascal / Volta): need CUDA 12.9 (last toolkit with codegen for these arches — 13.x dropped them entirely).build.rscatches the 13.x + old-arch pairing in a preflight and points at the fix path.sm_75–sm_90(Turing / Ampere / Hopper): 12.x or 13.x both work.sm_120(RTX 50-series Blackwell): need 12.8+; earlier toolkits lack Blackwell codegen.
-
OS: Linux (tested on modern glibc distributions) is the supported path. Windows users route through either the
cuda-onlybranch natively (NVIDIA + MSVC + CUDA) or WSL2 (any vendor WSL2 supports) — see Windows below. macOS is not supported (no CUDA, no modern SYCL runtime).
- "I just want to plot, Linux host" → container (path 1). Smallest
host install (just
podman+podman-compose+ the GPU passthrough bits —scripts/install-container-deps.shinstalls all of it). All toolchain lives inside the image. Auto-detects your GPU and pins the right CUDA / ROCm base. - "NVIDIA only, native binary, no SYCL/AdaptiveCpp" →
cuda-onlybranch (path 2). Three host packages —cmake+build-essential- the CUDA Toolkit. No LLVM/lld/AdaptiveCpp install. Smaller dep surface than main; same end result for NVIDIA users.
- "Full build — AMD / Intel / CPU support, parity tests on the host"
→
install-deps.sh(path 3). Auto-installs cmake, lld, LLVM 18, AdaptiveCpp from source. ~30-45 min first-time setup.
Three ways to get the dependencies in place, easiest first:
Easiest path — scripts/build-container.sh does host-side GPU
probing and feeds the right env vars to compose build. If you're
starting from a fresh host, scripts/install-container-deps.sh
installs the engine + GPU passthrough bits first (podman + GPU probe
nvidia-container-toolkit/ video-render groups, as appropriate; no native CUDA / ROCm / LLVM / AdaptiveCpp on the host):
./scripts/install-container-deps.sh # one-time: engine + GPU passthrough
./scripts/build-container.sh # auto: nvidia-smi → cuda, rocminfo → rocm
podman compose run --rm cuda plot -k 28 -n 10 -f <farmer-pk> -c <pool-contract> -o /outThe script handles a handful of host-side decisions that bare
podman compose build can't:
- Vendor pick (cuda / rocm / intel / cpu) from nvidia-smi /
rocminfo, or
--gpu cputo force CPU. - Multi-GPU fat binary (e.g.
CUDA_ARCH="61;86"on a 1070+3060 rig) — compose alone defaults to a single arch. - Pascal/Volta auto-pin to
nvidia/cuda:12.9.1-devel-ubuntu24.04when min arch < 75. CUDA 13 dropped sub-Turing codegen, so a Pascal user without this pin hits a build-timeUnsupported gpu architecture 'compute_61'error inside the container. - AMD
ACPP_GFXextract from rocminfo + the RDNA1 (gfx1010 → gfx1013) workaround for Radeon Pro W5700. --no-cachepass-through to force a clean rebuild after a toolchain bump.
You CAN run podman compose build directly — it just means setting
those env vars yourself. The compose YAML's defaults are conservative
(CUDA 13.0, sm_89, no AMD target without ACPP_GFX), so plain
podman compose build cuda only "just works" on Turing-or-newer
NVIDIA hosts. Anything else needs the script or the equivalent
manual env:
compose.yaml defines four vendor-specific services
sharing one Containerfile; the script just runs
compose build against whichever matches your hardware. Override
manually if you prefer:
# NVIDIA (default sm_89; override via $CUDA_ARCH=120 etc.)
podman compose build cuda
# AMD ROCm — set $ACPP_GFX from `rocminfo | grep gfx`.
ACPP_GFX=gfx1031 podman compose build rocm # Navi 22
ACPP_GFX=gfx1100 podman compose build rocm # Navi 31 (default)
# Intel oneAPI (experimental, untested).
podman compose build intel
# CPU-only (no GPU; AdaptiveCpp OpenMP backend; ~400 MB image).
# Plotting is 1-2 orders of magnitude slower than GPU — see CPU bullet
# under Hardware compatibility for the use case.
podman compose build cpuPlot files land in ./plots/ on the host. The container also bundles
the parity tests (sycl_sort_parity, sycl_g_x_parity, etc.) under
/usr/local/bin/ for quick first-port validation on a new GPU:
podman compose run --rm --entrypoint /usr/local/bin/sycl_sort_parity rocmFirst build is ~15-30 min (AdaptiveCpp + LLVM 18 compile from source); subsequent rebuilds reuse the cached layers. GPU performance inside the container is identical to native — kernels run on real hardware via the engine's GPU pass-through:
- NVIDIA: requires
nvidia-container-toolkiton the host. For Docker users, also run once after install:Podman 5.x with CDI works without the runtime-configure step.sudo apt install nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
- AMD:
/dev/kfd+/dev/dridevice files. The composerocmservice handles this automatically; for barepodman/docker runpass--device /dev/kfd --device /dev/dri --group-add video.
AMD GPUs need three pieces of friction handled correctly. None are optional on most hosts, and getting any one wrong tends to fail silently or in confusing ways:
-
ACPP_GFXmust be set to your GPU's gfx target. The kernels are AOT-compiled for a specific amdgcn ISA at build time. If the wrong arch is baked in, HIP loads the fatbinary without complaint but the kernels execute as silent no-ops at runtime — sort returns input unchanged, AES match finds zero matches, plots look valid but contain non-canonical proofs that won't qualify against real challenges.compose.yamldefaultsACPP_GFXto a placeholder string that AdaptiveCpp's HIP backend rejects loudly at build time, so an unset value fails fast with the placeholder visible in the error rather than silently using a default likegfx1100. Common values (rocminfo | grep gfxto confirm yours):gfx1030— RDNA2 Navi 21 (RX 6800 / 6800 XT / 6900 XT)gfx1031— RDNA2 Navi 22 (RX 6700 XT / 6700 / 6800M)gfx1100— RDNA3 Navi 31 (RX 7900 XTX / XT)gfx1101— RDNA3 Navi 32 (RX 7800 XT / 7700 XT)
-
Rootful
--privilegedfor runs. Rootless podman's default seccomp filter + capability set blocks some of the KFD ioctlslibhsa-runtime64needs during DMA setup. Without them you get a segfault deep inside the HSA runtime on the very first host→device copy, even thoughrocminfoworks fine. Builds don't need GPU access and can stay rootless if you prefer. -
sudostrips environment variables by default, including theACPP_GFXyou set in your shell. So a baresudo podman compose build rocmloses it. Either invoke the build script (it sets the var inside the sudo'd shell where compose can see it) or pass the var through explicitly.
The recommended invocation pair, in order of how short each one is:
# Build (autodetects ACPP_GFX from rocminfo — works under sudo too):
sudo ./scripts/build-container.sh
# Run a single test plot at k=22:
sudo podman run --rm --privileged \
--device /dev/kfd --device /dev/dri \
-v $PWD/plots:/out xchplot2:rocm \
test 22 <plot_id_hex> 2 0 0 -G -o /out
# Run real plotting:
sudo podman run --rm --privileged \
--device /dev/kfd --device /dev/dri \
-v $PWD/plots:/out xchplot2:rocm \
plot -k 28 -n 10 -f <farmer-pk> -c <pool-contract> -o /outIf sudo doesn't carry /opt/rocm/bin on your distro and the build
script can't find rocminfo, fall back to one of:
sudo -E ./scripts/build-container.sh # preserve your shell PATH
sudo ACPP_GFX=gfx1031 ./scripts/build-container.sh # explicit, no rocminfo neededOr skip the script entirely:
sudo ACPP_GFX=gfx1031 podman compose build rocmFor convenience, drop a wrapper at ~/.local/bin/xchplot2-amd:
#!/bin/bash
exec sudo podman run --rm --privileged \
--device /dev/kfd --device /dev/dri \
-v "$PWD/plots:/out" xchplot2:rocm "$@"Then xchplot2-amd plot -k 28 -n 10 -f ... -c ... -o /out just works.
./scripts/install-deps.sh # auto-detects distro + GPU vendorInstalls the toolchain via the system package manager (Arch, Ubuntu /
Debian, Fedora) plus AdaptiveCpp from source into /opt/adaptivecpp.
GPU vendor is auto-detected: nvidia-smi / rocminfo first,
/sys/class/drm PCI IDs as fallback (so fresh installs without driver
tools still work). On a no-GPU host (CI / build box) the script
errors out — pass --gpu nvidia to install the toolchain anyway.
--gpu amd forces the AMD path on dual-vendor hosts. Intel detection
currently errors with a hint pointing at --gpu nvidia (the SYCL
toolchain JITs onto Intel via AdaptiveCpp's generic SSCP target) or
the container. Pass --no-acpp to skip the AdaptiveCpp build and
let CMake fall back to FetchContent.
If you'd rather install dependencies yourself, the toolchain is:
| Dep | Notes |
|---|---|
| AdaptiveCpp 25.10+ | SYCL implementation. CMake auto-fetches it via FetchContent if find_package(AdaptiveCpp) fails — first build adds ~15-30 min. Disable with -DXCHPLOT2_FETCH_ADAPTIVECPP=OFF if you want a hard error. |
| CUDA Toolkit 12+ (headers) | Required on every build path because AdaptiveCpp's half.hpp includes cuda_fp16.h. nvcc itself only runs when XCHPLOT2_BUILD_CUDA=ON. Default is vendor-aware — ON for NVIDIA GPUs, OFF for AMD / Intel GPUs (even if nvcc is installed), falling through to nvcc-presence only when no GPU is probed (CI / container). Override with the env var. |
| LLVM / Clang ≥ 18 | clang, lld (AdaptiveCpp's CMake requires ld.lld), plus the libclang dev packages. install-deps.sh installs all of them; manual installs need to add lld-18 (apt) / lld (dnf, pacman) explicitly. |
| C++20 compiler | clang ≥ 18 or gcc ≥ 13. |
| CMake ≥ 3.24, Ninja, Python 3 | build tools. |
| Boost.Context, libnuma, libomp | AdaptiveCpp runtime deps. |
| Rust toolchain (stable) | for keygen-rs and cargo install. |
pos2-chip and FSE are auto-fetched at CMake configure time
(FetchContent); override -DPOS2_CHIP_DIR=/abs/path for a local
checkout.
For non-NVIDIA targets, the build also probes:
- ROCm 6+ (
rocminfo): if found, setsACPP_TARGETS=hip:gfxXXXX. - Intel oneAPI (Level Zero / compute-runtime): manual
ACPP_TARGETS.
cargo install --git https://github.com/Jsewill/xchplot2build.rs auto-detects the local GPU's compute capability by querying
nvidia-smi --query-gpu=compute_cap and builds for only that
architecture. That keeps the binary small and the build fast when the
install and the target GPU are the same machine.
If auto-detection fails (no nvidia-smi in PATH, or
nvidia-smi can't see a GPU — common when building inside a container
or on a headless build host that lacks the CUDA driver), the build
falls back to sm_89. Note that arch-detect picks which CUDA arch —
whether CUDA TUs build at all is a separate vendor-aware decision
(see XCHPLOT2_BUILD_CUDA in Environment variables).
If you need to target a GPU that isn't the one doing the build — or if
you want a single "fat build" binary that covers multiple
architectures — override with $CUDA_ARCHITECTURES:
# Fat build for Ada (4090) and Blackwell (5090):
CUDA_ARCHITECTURES="89;120" cargo install --git https://github.com/Jsewill/xchplot2
# Single target (e.g. Turing 2080 Ti):
CUDA_ARCHITECTURES=75 cargo install --git https://github.com/Jsewill/xchplot2Common values: 61 GTX 10-series, 70 Volta, 75 Turing, 80 A100,
86 RTX 30-series, 89 RTX 40-series, 90 H100, 120 RTX 50-series.
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build -jpos2-chip is auto-fetched via FetchContent; override with
-DPOS2_CHIP_DIR=/abs/path/to/pos2-chip to point at a local checkout.
Outputs:
build/tools/xchplot2/xchplot2build/tools/parity/{aes,xs,t1,t2,t3}_parity— bit-exact CPU/GPU tests
Two supported paths — native main doesn't work because AdaptiveCpp
has hard Linux-isms (libnuma, pthreads, LLVM SSCP) that fall apart on
Windows. Jump to the relevant subsection below:
- Native Windows build (
cuda-onlybranch) — recommended NVIDIA path. - Native Windows build — SYCL path (adventurous) — AMD/Intel/cross-vendor, untested.
NVIDIA only → use the
cuda-only
branch. Pure MSVC + CUDA Toolkit + Rust, no SYCL runtime involved.
See that branch's README for the VS 2022 / Windows SDK / LIB
troubleshooting (the LNK1181: kernel32.lib and friends).
AMD or Intel, or if you just want the main code path → run
under WSL2. WSL2 is a full Linux environment, so every install
option in this README works there unchanged — cargo install,
scripts/install-deps.sh, or the container (section 1 above).
Enable WSL2 once with wsl --install in an elevated PowerShell.
GPU access in WSL2:
- NVIDIA: install the latest "NVIDIA GPU Driver for Windows", nothing else — CUDA shows up inside WSL2 automatically.
- AMD: ROCm 6.1+ supports a limited card list on WSL2 (RX 7900 XTX, Radeon Pro W7900, specific Instincts). Follow AMD's "Install ROCm on WSL" guide.
- Intel: oneAPI on WSL2 via the Intel Linux graphics driver.
Once the GPU is visible from a WSL2 shell (nvidia-smi, rocminfo,
or sycl-ls), proceed with the native Linux instructions above.
Full walkthrough for the NVIDIA native path, repeated here so you don't have to flip between READMEs. Prerequisites:
- Windows 10 21H2+ or Windows 11, x64
- Visual Studio 2022 Community
with the "Desktop development with C++" workload. That workload
bundles MSVC + the Windows SDK; the SDK is non-optional because it
ships
kernel32.lib/user32.lib/ etc. thatlink.execonsumes. If you've trimmed the installer to "C++ build tools" only, open Visual Studio Installer → Modify → Individual components and tick the latest Windows 11 SDK before retrying. - CUDA Toolkit 12.0+ —
install after Visual Studio so the CUDA installer wires up the
MSBuild integration. 12.8+ required for RTX 50-series (Blackwell,
sm_120). - Rust using the MSVC
toolchain (
rustup default stable-x86_64-pc-windows-msvc). - CMake 3.24+ and Git for Windows.
Launch the x64 Native Tools Command Prompt for VS 2022 from the
Start menu — there are several similarly-named prompts (x86 /
x86_64 / 2019 / 2022); the one that matters is the x64 for 2022.
That prompt is the one that sets LIB, INCLUDE, and PATH so
cl.exe, link.exe, nvcc, and cmake all see each other plus
the Windows SDK. A plain cmd / PowerShell / Windows Terminal tab
does not do this — running cargo install from one of those
produces LNK1181: cannot open input file 'kernel32.lib' at the
first link step.
Quick sanity check in the prompt:
where link.exe
echo %LIB%%LIB% should include a ...\Windows Kits\10\Lib\...\um\x64
entry. If it doesn't, you're in the wrong prompt or the Windows SDK
component isn't installed.
Build:
set CUDA_ARCHITECTURES=89
cargo install --git https://github.com/Jsewill/xchplot2 --branch cuda-onlyOr for a local checkout you can iterate on:
git clone -b cuda-only https://github.com/Jsewill/xchplot2
cd xchplot2
set CUDA_ARCHITECTURES=89
cargo install --path .Set CUDA_ARCHITECTURES to match your card (see the list above).
PowerShell users: use $env:CUDA_ARCHITECTURES = "89" instead of
set. The CMake path (cmake -B build -S . && cmake --build build)
also works inside the same Native Tools prompt if you prefer that
over cargo install.
Strongly recommend WSL2 first (see the top of this section). This subsection exists because the path is in principle buildable on native Windows; in practice it's days of build-system tinkering without hardware the maintainers can iterate on. Not validated by us. File an issue with your findings.
What you're signing up for: AdaptiveCpp, built from source on
Windows, pointed at either AMD HIP SDK for Windows (for AMD) or
the CUDA Toolkit (for NVIDIA through SYCL, if you want the
main branch's cross-vendor code path on NVIDIA instead of
cuda-only's CUB one). xchplot2's CMake then finds that install
via find_package(AdaptiveCpp) and builds normally. AdaptiveCpp's
FetchContent fallback is not viable on native Windows — its own
CMakeLists assumes Linux-isms (libnuma, pthreads) that fall apart.
Pre-install is mandatory.
Prerequisites (on top of the cuda-only prereqs above — MSVC, Windows SDK, Rust, CMake, Git):
- LLVM 16–20 with Clang + LLD + the CMake development package
(
LLVMConfig.cmake/ClangConfig.cmake). Version coverage of Windows binary installers is patchy for these components; a self-built LLVM is usually the path of least resistance. See AdaptiveCpp's Windows install guide for the currently-recommended source. - AMD HIP SDK for Windows (for the AMD target) from AMD's HIP SDK download page. AMD officially flags it as preview: limited card list, different device-library layout vs Linux ROCm, runtime coverage varies per GPU.
- CUDA Toolkit 12+ (for the NVIDIA-via-SYCL target). Same
installer as the
cuda-onlypath above.
Rough build sequence from a clean x64 Native Tools Command Prompt for VS 2022 (paths are indicative — match your installs):
:: 1. Build AdaptiveCpp
git clone --branch v25.10.0 https://github.com/AdaptiveCpp/AdaptiveCpp.git
cd AdaptiveCpp
cmake -B build -S . -G Ninja ^
-DCMAKE_BUILD_TYPE=Release ^
-DCMAKE_INSTALL_PREFIX=C:\opt\adaptivecpp ^
-DLLVM_DIR=C:\path\to\llvm\lib\cmake\llvm ^
-DWITH_CUDA_BACKEND=OFF ^
-DWITH_HIP_BACKEND=ON ^
-DROCM_PATH="C:\Program Files\AMD\ROCm\6.1"
cmake --build build --parallel
cmake --install build
:: 2. Build xchplot2 main against the install
cd \path\to\xchplot2
:: CMAKE_PREFIX_PATH only needed if you installed AdaptiveCpp to a
:: non-default Windows path. The build's auto-discovery only covers
:: Linux's /opt/adaptivecpp — Windows users tell CMake explicitly.
set CMAKE_PREFIX_PATH=C:\opt\adaptivecpp
set ACPP_TARGETS=hip:gfx1101
set XCHPLOT2_BUILD_CUDA=OFF
cargo install --path .Flip WITH_HIP_BACKEND ↔ WITH_CUDA_BACKEND and set
ACPP_TARGETS=cuda:sm_XX for the NVIDIA-through-SYCL variant.
Failure modes you should expect to triage:
- Missing LLVM CMake modules — source-built LLVM with
LLVM_INSTALL_UTILS=ONand the clang / clang-tools-extra projects enabled is the reliable recipe. - Generic SSCP compiler disabled (
DEFAULT_TARGETSwarning during AdaptiveCpp configure) — harmless if you setACPP_TARGETS=hip:gfxXXXXexplicitly at xchplot2's configure. ROCM_PATHmismatch — AMD's Windows installer versions the directory (C:\Program Files\AMD\ROCm\6.1\); match it exactly.- Clean build, runtime kernel failures — the HIP SDK for
Windows preview doesn't cover every GPU the Linux ROCm path
does. Run
scripts/test-multi-gpu.sh/xchplot2 test 22 ...with a k=22 plot first andxchplot2 verifythe result before committing a large batch.
Seriously, try WSL2 first.
xchplot2 plot -k 28 -n 10 \
-f <farmer-pk> \
-c <pool-contract-address> \
-o <output-dir>Pool variants: -p <pool-pk> or --pool-ph <pool-ph>. Other common
flags: -s <strength>, -T testnet, -S <seed> for reproducible runs,
-v verbose. Full help: xchplot2 -h.
For long batches, --skip-existing skips plots whose output file is
already a complete .plot2 (magic bytes + non-trivial size), and
--continue-on-error logs per-plot failures and keeps going instead of
aborting the whole run. Both flags work for plot and batch modes.
Plots are written to <name>.plot2.partial and atomically renamed on
completion, so a crash / SIGINT / ENOSPC mid-write never leaves a
malformed plot at the destination. A first Ctrl-C asks the plotter to
finish the plot in flight and stop; a second hard-kills.
Both are v2 PoS fields and default to 0.
<plot-index> (u16) is the within-group identifier; plot -n N
uses it as the base and increments per plot (so -i 0 -n 1000
produces plots with plot_index 0..999).
<meta-group> (u8) is a challenge-isolation boundary — plots with
different meta_group values are guaranteed never to pass the same
challenge.
The PoS2 spec defines a grouped-plot file layout (multiple plots
interleaved into one container per storage device, for harvester
seek amortization), but the on-disk format is not yet defined
upstream in pos2-chip / chia-rs. xchplot2 currently produces one
.plot2 file per plot — this is in lieu of those upstream
decisions. When the grouped layout lands, the auto-incrementing
<plot-index> above is the per-plot within-group identifier it
will expect.
xchplot2 devices prints id, name, backend, VRAM, compute-unit count,
and which sort path each device will use (CUB on cuda-backend devices
when this build links CUB, SortSycl otherwise) — the printed [N]
index is the value --devices N accepts:
$ xchplot2 devices
Visible devices (2 GPU + 1 CPU):
[0] NVIDIA GeForce RTX 4090 backend=cuda vram=24076 MB CUs=128 sort:CUB
[1] AMD Radeon Pro W5700 backend=hip vram= 8176 MB CUs=36 sort:SYCL
[cpu] Host CPU plotter backend=omp threads=32 sort:SYCL (1-2 orders slower than GPU)
Use `--devices N` (id) for a specific GPU,
`--devices gpu` for every GPU,
`--devices cpu` for the host CPU only,
`--devices all` for every GPU + CPU,
or any comma combination (e.g. `0,2,cpu`).
Both plot and batch accept --devices <SPEC> to fan plots out
across multiple devices — one worker thread per device, each with its
own buffer pool and writer channel. Plots are partitioned round-robin,
so a batch of 10 plots on 2 GPUs sends plots 0/2/4/6/8 to the first
GPU and 1/3/5/7/9 to the second.
# Every visible GPU — enumerated at runtime. No CPU worker.
xchplot2 plot --k 28 --num 10 -f <farmer-pk> -c <pool-contract> \
--out /mnt/plots --devices gpu
# Every visible GPU PLUS a CPU worker on the same batch.
xchplot2 plot ... --devices all
# Only these specific GPU ids (sorted, deduplicated).
xchplot2 plot ... --devices 0,2,3
# Explicit single id (same as omitting the flag on a single-GPU host).
xchplot2 plot ... --devices 0
# CPU-only: AdaptiveCpp OpenMP backend (slow). Use the `cpu` token in
# --devices, or the standalone --cpu flag (equivalent on its own).
xchplot2 plot ... --devices cpu
xchplot2 plot ... --cpu
# Mix tokens: specific GPUs + CPU.
xchplot2 plot ... --devices 0,1,cpuCPU plotting is 1-2 orders of magnitude slower than GPU — meant for GPU-less hosts, headless CI, or as an extra background worker. Don't expect GPU-grade throughput from a CPU worker on a heterogeneous batch.
Omitted flag = single device via the default SYCL / CUDA selector — identical to pre-multi-GPU behavior, zero regression risk.
Caveats for v1:
- Static round-robin partition. If your GPUs differ in speed the
batch finishes only as fast as the slowest worker's slice; use
--devicesto pick matched cards when that matters. - Each worker gets its own ~4 GB pinned host pool, so host RAM scales linearly. A 4-GPU rig pins ~16 GB — size accordingly.
- The workers share
stderr(line-buffered, atomic per-fprintf) so log lines from different GPUs may interleave. Fine for progress, not for parsing.
Smoke test: scripts/test-multi-gpu.sh exercises argument parsing
(works on any host, even single-GPU) and, when 2+ GPUs are visible,
runs a live k=22 plot across --devices 0,1.
xchplot2 test <k> <plot-id-hex> [strength] ... # single plot, raw inputs
xchplot2 batch <manifest.tsv> [-v] [--skip-existing] [--continue-on-error]
[--devices <SPEC>]
xchplot2 verify <file.plot2> [--trials N] # run N random challenges
xchplot2 parity-check [--dir PATH] # CPU↔GPU regression screenverify opens a .plot2 through pos2-chip's CPU prover and runs N
(default 100) random challenges. Zero proofs across a reasonable sample
strongly indicates a corrupt plot; the command exits non-zero in that
case. Intended as a quick sanity check before farming a newly built
batch — not a replacement for chia plots check.
parity-check execs every *_parity binary in --dir (default
./build/tools/parity) and summarizes PASS/FAIL with per-test wall
time. Use after a refactor or driver update to confirm CPU↔GPU
agreement is still bit-exact across aes / xs / t1 / t2 / t3 /
plot_file. Requires cmake --build to have produced the parity
binaries first.
-
Listing visible GPUs:
xchplot2 devicesprints id, name, backend, VRAM, compute-unit count, and which sort path each device will use (CUB on cuda-backend devices when this build links CUB; SortSycl otherwise). Use the printed[N]index with--devices Nforplot/batch. -
Hybrid hosts (NVIDIA + AMD/Intel on the same box): a single binary handles all visible GPUs.
xchplot2 plot --devices gpuspawns a worker per GPU (use--devices allto also add a CPU worker); each worker picks the right sort backend at queue construction (CUB on NVIDIA, hand-rolled SYCL radix on AMD/Intel) via the runtime dispatcher inSortDispatch.cpp. No rebuild required to add a second-vendor card. -
[AdaptiveCpp Warning] [backend_loader] Could not load library: /opt/adaptivecpp/lib/hipSYCL/librt-backend-cuda.so (libcudart.so.11.0: cannot open shared object file): cosmetic only — AdaptiveCpp built with CUDA backend support but no CUDA runtime to load. Happens when AdaptiveCpp was installed out-of-band rather than viascripts/install-deps.sh --gpu amd(which sets-DCMAKE_DISABLE_FIND_PACKAGE_CUDA=TRUE). To suppress without a rebuild:export ACPP_VISIBILITY_MASK=hip;ompso AdaptiveCpp skips the CUDA backend probe entirely. -
T1 match produced 0 entrieson RDNA1 (gfx1010/gfx1011/gfx1012, including the Radeon Pro W5700 / RX 5700 XT). The communitygfx1013AOT-spoof default was observed to silently compile no-op kernel stubs on at least one W5700 + ROCm 6 + AdaptiveCpp 25.10 host. Default flipped toACPP_TARGETS=generic(SSCP JIT) in recent main;cargo install --forcepast commitd939ee8restores correct behavior. To restore the old spoof,XCHPLOT2_FORCE_GFX_SPOOF=1 cargo install .... The startup self- test inSyclBackend::queue()catches the no-op-kernel case at queue construction with a clear exception, so this surfaces immediately rather than as empty pipeline output minutes in. -
CUB ... invalid argumentmid-pipeline, orsycl_backend::queue: device id 0 out of range (found 0 usable GPU device(s))with--devices Nwhile the default selector finds a GPU: pre-762fde2symptoms of CUB-only sort being dispatched against an AMD/Intel device (or being filtered out of the device list). The runtime sort dispatcher fixes both —git pull && cargo install --path . --forceto upgrade. -
Deep-pipeline diagnostics: set
POS2GPU_T1_DEBUG=1for verbose per-stage dumps (Xs gen / sort intermediates, T1 match input/output samples, AES T-table sanity). Useful when the symptom isn't on the list above and you want to localize where the data goes wrong.
| Variable | Effect |
|---|---|
XCHPLOT2_BUILD_CUDA=ON|OFF |
Override the build-time CUB / nvcc-TU switch. Default is vendor-aware (NVIDIA → ON; AMD / Intel → OFF; no GPU → nvcc-presence). Force OFF on dual-toolchain hosts (CUDA + ROCm) where you want the SYCL-only build. |
XCHPLOT2_STREAMING=1 |
Force the low-VRAM streaming pipeline even when the pool would fit. |
XCHPLOT2_STREAMING_TIER=plain|compact|minimal |
Override the streaming-tier auto-pick (plain = ~7.3 GB peak, no parks; compact = ~5.2 GB peak, full parks + N=2 T2 match tiling; minimal = ~3.76 GB peak with full host-pinned slicing of T1/T3 match + tiled CUB outputs in all sort phases + tiled Xs gen/sort/pack — targets 5 GiB+ cards). Equivalent CLI flag: --tier. |
POS2GPU_MAX_VRAM_MB=N |
Cap the pool/streaming VRAM query to N MB (exercise streaming fallback). |
POS2GPU_STREAMING_STATS=1 |
Log every streaming-path malloc_device / free. |
POS2GPU_POOL_DEBUG=1 |
Log pool allocation sizes at construction. |
POS2GPU_PHASE_TIMING=1 |
Per-phase wall-time breakdown (Xs / sort / T1 / T2 / T3) on stderr. |
ACPP_GFX=gfxXXXX |
AMD only — required at build time; sets AOT target for amdgcn ISA. |
ACPP_TARGETS=... |
Override AdaptiveCpp target selection (defaults: NVIDIA generic, AMD hip:$ACPP_GFX). |
CUDA_ARCHITECTURES=sm_XX |
Override the CUDA arch autodetected from nvidia-smi. |
CUDA_PATH=/path/to/cuda |
Override the CUDA Toolkit root for linking (default: /opt/cuda, /usr/local/cuda). Useful on JetPack / non-standard installs. |
CUDA_HOME=/path/to/cuda |
Fallback for CUDA_PATH — same effect. |
POS2_CHIP_DIR=/path |
Build-time: point at a local pos2-chip checkout instead of FetchContent. |
XCHPLOT2_TEST_GPU_COUNT=N |
Override scripts/test-multi-gpu.sh's auto-detected GPU count (forces run / skip without consulting nvidia-smi). |
v2 (CHIP-48) farming in stock chia-blockchain is presently unfinished
upstream — services aren't wired into the farmer group, a message
handler's signature doesn't match its decorator, ProofOfSpace. challenge is computed from the wrong input, and the dependency pin
on chia_rs excludes the 0.42 release where compute_plot_id_v2
lives. contrib/testnet-farming.patch is a minimal self-contained
fix-up that gets a private testnet running end-to-end:
git clone https://github.com/Chia-Network/chia-blockchain
cd chia-blockchain
git checkout 39f8bec88 # 2.7.0 Checkpoint Merge
git apply /path/to/xchplot2/contrib/testnet-farming.patchThe patch's header comment describes each hunk. None of the changes are xchplot2-specific — they're the farmer / harvester / daemon pieces any v2 plot needs for farming, regardless of who produced it.
src/gpu/ GPU kernels — AES, Xs, T1, T2, T3.
CUDA path: .cu files via nvcc + CUB sort.
SYCL path: matching .cpp files via
AdaptiveCpp + hand-rolled LSD radix.
src/host/
├── GpuPipeline Xs → T1 → T2 → T3 device orchestration;
│ pool + streaming (low-VRAM) variants
├── GpuBufferPool persistent device + 2× pinned host pool
├── BatchPlotter producer / consumer batch driver
└── PlotFileWriterParallel sole TU touching pos2-chip headers
tools/xchplot2/ CLI: plot / test / batch
tools/parity/ CPU↔GPU bit-exactness tests
keygen-rs/ Rust staticlib: plot_id_v2, BLS HD, bech32m
PoS2 plots are k=28 by spec. Four code paths, dispatched automatically based on available VRAM at batch start:
-
Pool path (~11 GB device + ~4 GB pinned host; 12 GB+ cards reliably). The persistent buffer pool is sized worst-case and reused across plots in
batchmode for amortised allocator cost and double-buffered D2H. Xs sort's keys_a slot aliases d_storage tail (idle during Xs gen+sort), trimming pair_b's worst case frommax(cap·12, 4·N·u32 + cub)tomax(cap·12, 3·N·u32 + cub)— saves ~1 GiB at k=28. Targets: RTX 4090 / 5090, A6000, H100, RTX 4080 (16 GB), and 12 GB cards like RTX 3060 / RX 6700 XT. -
Plain streaming (~7.3 GB peak + 128 MB margin; ≥ 7.42 GiB free at k=28). Allocates per-phase and frees between phases, but keeps large intermediates (
d_t1_meta,d_t1_keys_merged,d_t2_meta,d_t2_xbits,d_t2_keys_merged) alive across their idle windows instead of parking them on pinned host. T2 match runs as a single full-cap pass (N=1). Used on 10-11 GB cards that can't fit the pool but have headroom above the compact floor. ~400 ms/plot faster than compact at k=28 because there are no park/rehydrate PCIe round-trips. -
Compact streaming (~5.2 GB peak + 128 MB margin; ≥ 5.33 GiB free at k=28). All three match phases (T1/T2/T3) are tiled N=2 across disjoint bucket ranges with half-cap device staging and D2H-to-pinned-host between passes. T1 + T2 sorts are tiled (N=2 and N=4) with merge trees, and
d_t1_meta,d_t2_meta, and the*_keys_mergedbuffers are parked on pinned host across their sort phases and JIT-H2D'd only for the next consumer. Xs is inlined as gen → sort → pack with separate-allocation scratch so keys_a + vals_a can be freed right after CUB sort. Peak at k=28 is 5200 MB (measured on sm_89); per-phase live maxes:Phase Peak (MB) Xs 4128 T1 match 5168 T1 sort 5200 T2 match 5200 T2 sort 5200 T3 match 5200 T3 sort 4228 A BatchPlotter preflight rejects cards reporting less than
streaming_peak_bytes(k) + 128 MBfree before any queue work, so mid-pipeline OOM is impossible on supported configurations. Practical targets: 6 GB cards on the edge (card-dependent; RTX 2060 typically has ~5.5 GiB free which has ~170 MB slack over the 5328 MB requirement), 8 GB cards comfortable, 10 GB and up ample. Log the full alloc trace withPOS2GPU_STREAMING_STATS=1. -
Minimal streaming (~3.76 GB peak + 128 MB margin; ≥ 3.80 GiB free at k=28). Layered cuts on top of compact:
- N=8 T2 match staging. cap/8 ≈ 570 MB vs compact's cap/2 ≈ 2280 MB — saves ~1.5 GB on the T2-match peak.
- Tiled gathers in T1 sort + T2 sort meta + T2 sort xbits. Each gather output produced in N=4 tiles, D2H'd to host pinned (reusing the existing parking buffers) one tile at a time, then rebuilt on device after the cap-sized inputs are freed. Drops each gather peak from 5200 MB → ~3640 MB.
- Sliced T1 match. N passes (one per section_l) emit to a cap/N device staging pair, D2H per pass to host pinned. d_xs (2048 MB at k=28) no longer co-resides with full-cap d_t1_meta + d_t1_mi → T1-match peak drops from 5168 MB → 3023 MB.
- Sliced T3 match. d_t2_meta_sorted parked on host across T3 match; per pass H2Ds the (section_l, section_r) row slices onto a small device buffer pair. d_t2_xbits_sorted + d_t2_keys_merged remain full-cap on device for binary-search / target reads. T3-match peak: 5200 MB → 3754 MB.
- Per-tile CUB outputs in T1/T2/T3 sort sub-phases. T1 and T2
sort use cap/2 / cap/4 device output buffers respectively, D2H
per tile to USM-host accumulators, with the existing 2-way merge
kernel reading USM-host inputs. T2 additionally parks AB / CD
intermediates to host between tree steps so the final merge
sees only its own outputs. T3 sort uses cap/2 tile + host-side
std::inplace_merge. CUB sub-phase peaks: 4170-4228 MB → 3155-3640 MB. - Tiled Xs gen+sort+pack. N=2 position halves through cap/2
ping-pong buffers + USM-host accumulator + 2-way merge, then
pack runs in cap/2 halves with D2H per tile to a host-pinned
XsCandidateGpuaccumulator (final d_xs rehydrated H2D). Xs phase peak: 4128 MB → 3072 MB.
Bottleneck after all six cuts is the T3 match phase at 3754 MB. Targets 5 GiB+ cards comfortably (RTX 2060, RX 6600 XT, RX 7600 with ~1.7+ GiB headroom). 4 GiB cards (GTX 1050 Ti / 1650, RTX 3050 4GB, MX450) are an edge case — real 4 GiB physical hardware reports ~3.5 GiB free post-CUDA-context, just under the 3.80 GiB required floor. Trade-off: ~6 extra cap-sized PCIe round-trips per plot push k=28 wall on sm_89 from ~13 s/plot (compact) to ~34 s/plot (minimal). There is no smaller tier — a forced minimal on a card below the floor throws rather than falling further.
At pool construction xchplot2 queries cudaMemGetInfo on the
CUDA-only build, or global_mem_size (device total) on the SYCL
path — SYCL has no portable free-memory query, so the check
effectively approximates "free == total" and lets the actual
malloc_device failure trigger the fallback. If the pool doesn't
fit, the streaming-tier dispatch picks the largest tier that fits
with the 128 MB margin: plain if free ≥ 7.42 GiB, else compact if
free ≥ 5.33 GiB, else minimal. XCHPLOT2_STREAMING=1 forces
streaming even when the pool would fit; --tier plain|compact|minimal (or XCHPLOT2_STREAMING_TIER) overrides the
auto-pick. Forced plain or compact below their floor warns and
proceeds (caller's risk); forced minimal below its floor throws
because there is no smaller tier to fall back to.
Plot output is bit-identical across all four paths — streaming reorganises memory, not algorithms. Verified at k=22 with md5sum across pool / plain / compact / minimal.
k=28, strength=2, RTX 4090 (sm_89), PCIe Gen4 x16. Steady-state per-plot
wall from xchplot2 batch (10-plot manifest, mean):
| Build | Per plot | Notes |
|---|---|---|
| pos2-chip CPU baseline | ~50 s | reference |
cuda-only branch |
2.15 s | original CUDA-only path |
main, XCHPLOT2_BUILD_CUDA=ON (CUB sort) |
2.41 s | NVIDIA fast path on the SYCL/AdaptiveCpp port |
main, XCHPLOT2_BUILD_CUDA=OFF (hand-rolled SYCL radix) |
3.79 s | cross-vendor fallback (AMD/Intel) on AdaptiveCpp |
| plain streaming tier (10-11 GB cards) | ~5.7 s | no parks, single-pass T2 match; ~400 ms/plot faster than compact |
| compact streaming tier (6-8 GB cards) | ~7.3 s | full parks + N=2 T2 match |
| minimal streaming tier (4 GiB cards) | TBD | full parks + N=8 T2 match; smallest peak (~3.7 GB) |
main on RX 6700 XT (gfx1031 / ROCm 6.2 / AdaptiveCpp HIP) |
9.97 s | AMD batch steady-state at k=28; T-table AES near-optimal on RDNA2 via this compiler stack |
The main/CUB row is +12% over cuda-only from extra AdaptiveCpp
scheduling overhead. The SYCL row is +57% over CUB on the same NVIDIA
hardware; ~88% of GPU compute is identical between the two paths
(nsys per-kernel breakdown), and the gap is dominated by host-side
runtime overhead in AdaptiveCpp's DAG manager rather than kernel
performance. AMD and Intel runtimes are untested; expect roughly the
SYCL-row latency adjusted for relative GPU throughput.
Numbers above are single-GPU. With --devices 0,1,... the batch is
partitioned round-robin across N worker threads (one per device), so
wall-clock throughput is bounded by the slowest device's slice —
≈ linear scaling on matched cards, less if cards differ in speed.
Live multi-GPU plots were confirmed end-to-end on NVIDIA; per-device
numbers will vary with PCIe bandwidth sharing on the host root
complex.
MIT — see LICENSE and NOTICE for third-party attributions. Built collaboratively with Claude.
If you appreciate this, and want to give back, feel free.
xch1d80tfje65xy97fpxg7kl89wugnd6svlv5uag2qays0um5ay5sn0qz8vph8