TurboQuant-Pytorch

High-Performance Vector Quantization Engine
A Implementation of the turboquant paper by Pytorch C++

English Version

TurboQuant is a specialized, high-performance vector quantization library designed for Large Language Models (LLMs) and vector search applications. residual compensation.

Key Features

Turbo-Charged C++ Core: Core operations like rotation, projection, and quantization are implemented in optimized C++ for millisecond-level inference.
Lloyd-Max Optimization: Automatically computes the most efficient centroids for Gaussian distributions using Scipy's K-Means.
Unbiased Residual Compensation: Uses QJL signs to preserve vector magnitude and direction, minimizing cumulative error in deep networks.
Smart Matrix Caching: Automatically caches trained centroids ($\mathcal{C}$) and orthogonal matrices ($\Pi, S$) for instant engine startup.
Adaptive Dimension Support: Fully compatible with any dimension $d$ and any bit-rate $b$ (from 1-bit to 8-bit). TurboQuant is a high-performance quantization library designed for Large Language Models (LLMs) and vector search applications. By offloading core computations to C++ and integrating mathematical optimization, TurboQuant significantly reduces memory overhead while maintaining near-lossless precision.

Key Highlights

Extreme C++ Acceleration: Core operations like Matrix Rotation, Projection, and quantization logic are deeply optimized using C++/LibTorch to achieve millisecond-level inference.
Lloyd-Max Mathematical Optimization: Automatically calculates optimal centroids for Gaussian distributions using Scipy-based K-Means, ensuring high-precision quantization.
Unbiased Residual Compensation: Utilizes QJL sign bits to preserve vector direction and magnitude, solving cumulative error issues in deep neural networks.
Intelligent Matrix Caching : Automatically caches trained centroids and orthogonal matrices (Pi, S) for "instant-on" engine startup.
High Elasticity & Customization: Supports arbitrary dimensions d and dynamic bitrate switching from 1-bit to 8-bit.

Comparison

How to read the benchmark chart:

Y-Axis (Fidelity): Higher Cosine Similarity means more accurate vector reconstruction.
X-Axis (Latency): Lower values indicate faster C++/LibTorch execution.
Bubble Size (Memory): Larger bubbles represent higher memory compression ratios.
- 1-bit: 32x Compression (Largest Bubble)
- 2-bit: 16x Compression
- 4-bit: 8x Compression
- Int8 (Baselines): 4x Compression (Smallest Bubble)

Installation

# Clone the repository
git clone https://github.com/ericoder960803/TurboQuant.git

cd TurboQuant

# Install in editable mode (Builds C++ extension automatically)
pip install -e .

Usage

TurboQuant provides a seamless PyTorch-like API. You can easily integrate it into your inference pipeline.

Quick Start

import torch
from turboquant import TurboQuantEngine

# 1. Initialize the engine
# d: vector dimension, b: target bit-rate (1, 2, 4, or 8)
d = 1024
b = 2
engine = TurboQuantEngine(d=d, b=b, cache=True)

# 2. Prepare your high-precision vector (FP32)
x = torch.randn(d)

# 3. Encode (Compression)
# idx: Lloyd-Max centroids indices
# qjl: 1-bit residual signs
# gamma: Dynamic scaling factor for reconstruction
idx, qjl, gamma = engine.encode(x)

# 4. Decode (Decompression)
x_hat = engine.decode(idx, qjl, gamma)

# 5. Check Fidelity
similarity = torch.nn.functional.cosine_similarity(x.unsqueeze(0), x_hat.unsqueeze(0))
print(f"Reconstruction Cosine Similarity: {similarity.item():.4f}")

Usage Examples

1. LLM KV-Cache Management (16x Memory Saving)

Ideal for long-context LLM inference (e.g., Llama-3) where KV-cache memory is the primary bottleneck.

import torch
from turboquant import TurboQuantEngine

class TurboQuantKVCache:
    def __init__(self, dim=4096, bits=2):
        # 2-bit quantization compresses 4KB into 0.25KB
        self.engine = TurboQuantEngine(d=dim, b=bits, cache=True)
        self.cache = [] 

    def push(self, key_tensor):
        """Compress and store in cache"""
        packet = self.engine.encode(key_tensor)
        self.cache.append(packet)

    def fetch_all(self):
        """Decompress all vectors for Attention calculation"""
        if not self.cache: return None
        return torch.stack([self.engine.decode(*p) for p in self.cache])
# Usage
kv_manager = TurboQuantKVCache(dim=4096, bits=2)
kv_manager.push(torch.randn(4096)) # Encode new key
keys = kv_manager.fetch_all()      # Restore for Attention [Seq_Len, Dim]

2. High-Speed Vector Search

Enables high-fidelity vector databases or RAG systems with minimal storage footprint.

import torch
from turboquant import TurboQuantEngine

# 1. Setup Database (10,000 vectors)
D, B = 1024, 2
engine = TurboQuantEngine(d=D, b=B)
database = torch.randn(10000, D)

# 2. Offline Compression
compressed_db = [engine.encode(v) for v in database]

# 3. Online Search
query = torch.randn(D)
reconstructed_db = torch.stack([engine.decode(*p) for p in compressed_db])
scores = torch.nn.functional.cosine_similarity(query.unsqueeze(0), reconstructed_db)

# 4. Get Top-K
top_values, top_indices = torch.topk(scores, k=5)
print(f"Top Indices: {top_indices.tolist()}")

Mathematical Foundation

The reconstruction $\hat{x}$ is computed as: $$\hat{x} = \Pi^T ( \mathcal{C}{idx} + \gamma \cdot \sqrt{\frac{\pi}{2d}} \cdot S^T q{jnl} )$$ Where:

$\Pi$: Orthogonal Rotation Matrix
$\mathcal{C}$: Lloyd-Max Optimal Centroids
$S$: QJL Projection Matrix

Citation

If you find TurboQuant-PyTorch useful in your research or project, please cite the original paper:

Original Paper (arXiv:2504.19874)

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}
@misc{ericliam2026turboquant,
  author = {Eric Liam},
  title = {TurboQuant-PyTorch: High-Performance C++/LibTorch Implementation},
  year = {2026},
  publisher = {GitHub},
  howpublished = {\url{[https://github.com/ericoder960803/TurboQuant-PyTorch](https://github.com/ericoder960803/TurboQuant-PyTorch)}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
tests		tests
turboquant		turboquant
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
setup.py		setup.py
turboquant_benchmark_large.png		turboquant_benchmark_large.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant-Pytorch

High-Performance Vector Quantization Engine
A Implementation of the turboquant paper by Pytorch C++

English Version

Key Features

Key Highlights

Comparison

How to read the benchmark chart:

Installation

Usage

Quick Start

Usage Examples

1. LLM KV-Cache Management (16x Memory Saving)

2. High-Speed Vector Search

Mathematical Foundation

Citation

Original Paper (arXiv:2504.19874)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TurboQuant-Pytorch

High-Performance Vector Quantization Engine A Implementation of the turboquant paper by Pytorch C++

English Version

Key Features

Key Highlights

Comparison

How to read the benchmark chart:

Installation

Usage

Quick Start

Usage Examples

1. LLM KV-Cache Management (16x Memory Saving)

2. High-Speed Vector Search

Mathematical Foundation

Citation

Original Paper (arXiv:2504.19874)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

High-Performance Vector Quantization Engine
A Implementation of the turboquant paper by Pytorch C++

Packages