A minimal Pytorch-compatible library supporting basic unstructured sparse operations (spops). Some of the kernels are borrowed from sputnik. Additionally, the kernels used in the Robust Adaptation (RoSA) paper (GitHub) are included in this repository.
spops builds a CUDA / CPU extension against your installed PyTorch, so PyTorch
(and a matching CUDA toolkit) must be available before the build. The
package uses a standard PEP 517 build (setup.py + pyproject.toml).
- Python 3.9+
- PyTorch 2.0+ (must be importable at build time)
- CUDA toolkit matching your PyTorch build (for the GPU extension)
ninja,pybind11,numpy,setuptools,wheelavailable in the build env
Because PEP 517 isolated builds create a fresh env without your CUDA-enabled PyTorch, you almost always want to build against the env you have:
pip install ninja pybind11 numpy scipy
pip install --no-build-isolation .For local development:
pip install --no-build-isolation -e .The build emits PTX/cubin for sm_80;sm_89;sm_90+PTX by default — i.e. A100,
L40/L40S, and H100, with PTX embedded for forward compatibility. Override via
TORCH_CUDA_ARCH_LIST for faster, GPU-specific builds:
TORCH_CUDA_ARCH_LIST="9.0" pip install --no-build-isolation . # H100
TORCH_CUDA_ARCH_LIST="8.9" pip install --no-build-isolation . # L40
TORCH_CUDA_ARCH_LIST="8.0;9.0+PTX" pip install --no-build-isolation . # A100 + JITUsing with uv
Add spops as a path or git source and tell uv to skip build isolation so the
build can see the project's torch:
[tool.uv.sources]
spops = { path = "path/to/spops" } # built wheel install
# spops = { path = "path/to/spops", editable = true } # editable install
[tool.uv]
no-build-isolation-package = ["spops"]Make sure torch, ninja, pybind11, numpy, setuptools, and wheel are
in your project's dependencies so they are present when uv builds spops.
An m x n sparse matrix with nnz non-zero values in spops is stored in CSR format, including the following lists:
values: the list of non-zero values of the matrix with lengthnnzrow_offsets: a list ofm + 1indices, where theith andi+1th elements show the start and end of rowiin thevalueslist, respectively.col_idx: a list ofnnzindices, storing the column index of each non-zero value.row_idx: a permutation of the numbers0tom-1, sorting the row indices based on the number of non-zeros.
Below you can find a list of supported operations and how to use them.
Add a sparse CSR matrix A to a dense matrix B using the spops.csr_add(A_val, A_row_offsets, A_row_indices, A_col_indices, B) method. This operation is used in the RoSA paper.
Multiply a sparse CSR matrix A into a dense matrix B, resulting in another dense matrix. Simply use the method spops.spmm(A_val, A_row_offsets, A_row_indices, A_col_indices, B, m), where m is the number of rows in A.
Multiply two dense matrices A and B, but only calculate the result for a sparse subset of the output elements. This operation is supported in spops.sddmm(out_row_offsets, out_row_indices, out_col_indices, A, BT), where BT is the transposed version of B, by two different kernels (specify using the backend argument):
- The
sputnikkernel, which works with general sparsity patterns - The
structure_awarekernel specifically designed to leverage the sparsity masks that we observe in RoSA, where the non-zero values tend to cluster in a small subset of the rows/columns.
Default is structure_aware.
Transposes a CSR sparse matrix A. Use spops.csr_transpose(A_val, A_row_offsets, A_col_indices, m, n), where m and n are the number of rows and columns of A. Two backends are available via the backend argument:
torch(default): uses PyTorch's built-in sparse CSR support; works for both CPU and CUDA tensors.scipy: usesscipy.sparseon the CPU.
- Make sure that every input to the spops methods is contiguous.
row_offsetsshould always betorch.int32.- For the CUDA
csr_addand the fp16 fast path ofspmm, the other index lists (row_idx,col_idx) must betorch.int16. The fp32 CUDA paths and thesputnikSDDMM backend acceptint32and cast internally; thestructure_awareSDDMM backend takesint16directly. row_idxis not a per-nnz row label — it is a length-mpermutation of row indices sorted by descending non-zero count, used by the underlying sputnik kernels for warp-level load balancing. The canonical construction istorch.argsort(-torch.diff(row_offsets)).int().
A pytest suite covering all four operations on both CPU and CUDA lives in
tests/. CUDA tests are skipped automatically when no GPU is available.
pip install pytest
pytest -q tests # full suite
pytest -q tests -k "not cuda" # CPU onlyIf you plan to use our work in your projects, please consider citing our paper:
@article{nikdan2024rosa,
title={RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation},
author={Nikdan, Mahdi and Tabesh, Soroush and Crnčević, Elvir and Alistarh, Dan},
journal={arXiv preprint arXiv:2401.04679},
year={2024}
}