TurboQuant is a specialized, high-performance vector quantization library designed for Large Language Models (LLMs) and vector search applications. residual compensation.
- Turbo-Charged C++ Core: Core operations like rotation, projection, and quantization are implemented in optimized C++ for millisecond-level inference.
- Lloyd-Max Optimization: Automatically computes the most efficient centroids for Gaussian distributions using Scipy's K-Means.
- Unbiased Residual Compensation: Uses QJL signs to preserve vector magnitude and direction, minimizing cumulative error in deep networks.
-
Smart Matrix Caching: Automatically caches trained centroids (
$\mathcal{C}$ ) and orthogonal matrices ($\Pi, S$ ) for instant engine startup. -
Adaptive Dimension Support: Fully compatible with any dimension
$d$ and any bit-rate$b$ (from 1-bit to 8-bit). TurboQuant is a high-performance quantization library designed for Large Language Models (LLMs) and vector search applications. By offloading core computations to C++ and integrating mathematical optimization, TurboQuant significantly reduces memory overhead while maintaining near-lossless precision.
- Extreme C++ Acceleration: Core operations like Matrix Rotation, Projection, and quantization logic are deeply optimized using C++/LibTorch to achieve millisecond-level inference.
- Lloyd-Max Mathematical Optimization: Automatically calculates optimal centroids for Gaussian distributions using Scipy-based K-Means, ensuring high-precision quantization.
- Unbiased Residual Compensation: Utilizes QJL sign bits to preserve vector direction and magnitude, solving cumulative error issues in deep neural networks.
- Intelligent Matrix Caching : Automatically caches trained centroids and orthogonal matrices (Pi, S) for "instant-on" engine startup.
- High Elasticity & Customization: Supports arbitrary dimensions d and dynamic bitrate switching from 1-bit to 8-bit.
- Y-Axis (Fidelity): Higher Cosine Similarity means more accurate vector reconstruction.
- X-Axis (Latency): Lower values indicate faster C++/LibTorch execution.
- Bubble Size (Memory): Larger bubbles represent higher memory compression ratios.
- 1-bit: 32x Compression (Largest Bubble)
- 2-bit: 16x Compression
- 4-bit: 8x Compression
- Int8 (Baselines): 4x Compression (Smallest Bubble)
# Clone the repository
git clone https://github.com/ericoder960803/TurboQuant.git
cd TurboQuant
# Install in editable mode (Builds C++ extension automatically)
pip install -e .TurboQuant provides a seamless PyTorch-like API. You can easily integrate it into your inference pipeline.
import torch
from turboquant import TurboQuantEngine
# 1. Initialize the engine
# d: vector dimension, b: target bit-rate (1, 2, 4, or 8)
d = 1024
b = 2
engine = TurboQuantEngine(d=d, b=b, cache=True)
# 2. Prepare your high-precision vector (FP32)
x = torch.randn(d)
# 3. Encode (Compression)
# idx: Lloyd-Max centroids indices
# qjl: 1-bit residual signs
# gamma: Dynamic scaling factor for reconstruction
idx, qjl, gamma = engine.encode(x)
# 4. Decode (Decompression)
x_hat = engine.decode(idx, qjl, gamma)
# 5. Check Fidelity
similarity = torch.nn.functional.cosine_similarity(x.unsqueeze(0), x_hat.unsqueeze(0))
print(f"Reconstruction Cosine Similarity: {similarity.item():.4f}")Ideal for long-context LLM inference (e.g., Llama-3) where KV-cache memory is the primary bottleneck.
import torch
from turboquant import TurboQuantEngine
class TurboQuantKVCache:
def __init__(self, dim=4096, bits=2):
# 2-bit quantization compresses 4KB into 0.25KB
self.engine = TurboQuantEngine(d=dim, b=bits, cache=True)
self.cache = []
def push(self, key_tensor):
"""Compress and store in cache"""
packet = self.engine.encode(key_tensor)
self.cache.append(packet)
def fetch_all(self):
"""Decompress all vectors for Attention calculation"""
if not self.cache: return None
return torch.stack([self.engine.decode(*p) for p in self.cache])
# Usage
kv_manager = TurboQuantKVCache(dim=4096, bits=2)
kv_manager.push(torch.randn(4096)) # Encode new key
keys = kv_manager.fetch_all() # Restore for Attention [Seq_Len, Dim]Enables high-fidelity vector databases or RAG systems with minimal storage footprint.
import torch
from turboquant import TurboQuantEngine
# 1. Setup Database (10,000 vectors)
D, B = 1024, 2
engine = TurboQuantEngine(d=D, b=B)
database = torch.randn(10000, D)
# 2. Offline Compression
compressed_db = [engine.encode(v) for v in database]
# 3. Online Search
query = torch.randn(D)
reconstructed_db = torch.stack([engine.decode(*p) for p in compressed_db])
scores = torch.nn.functional.cosine_similarity(query.unsqueeze(0), reconstructed_db)
# 4. Get Top-K
top_values, top_indices = torch.topk(scores, k=5)
print(f"Top Indices: {top_indices.tolist()}")The reconstruction
-
$\Pi$ : Orthogonal Rotation Matrix -
$\mathcal{C}$ : Lloyd-Max Optimal Centroids -
$S$ : QJL Projection Matrix
If you find TurboQuant-PyTorch useful in your research or project, please cite the original paper:
@article{zandieh2025turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
journal={arXiv preprint arXiv:2504.19874},
year={2025}
}
@misc{ericliam2026turboquant,
author = {Eric Liam},
title = {TurboQuant-PyTorch: High-Performance C++/LibTorch Implementation},
year = {2026},
publisher = {GitHub},
howpublished = {\url{[https://github.com/ericoder960803/TurboQuant-PyTorch](https://github.com/ericoder960803/TurboQuant-PyTorch)}}
}