[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74247":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},74247,"turboquant","0xSero\u002Fturboquant","0xSero","TurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integration",null,"Python",1544,180,17,8,0,47,82,204,141,19.77,"GNU General Public License v3.0",false,"main",true,[],"2026-06-12 02:03:24","# TurboQuant: KV Cache Compression for LLM Inference\n\nImplementation of TurboQuant KV cache compression (ICLR 2026, arXiv:2504.19874) with vLLM integration. Tested on dense and MoE architectures across RTX 3090 and RTX 5090 GPUs.\n\n## Benchmark Results\n\n### RTX 5090 (32GB) -- Qwen3.5-27B-AWQ (dense, 4-bit weights, TP=1)\n\n**Setup**: Single RTX 5090, vLLM 0.18.0, `gpu_memory_utilization=0.90`, 16 full-attention layers out of 64 total (rest are linear-attention).\n\n| Metric | Baseline (bf16 KV) | TurboQuant (3b key \u002F 2b val) |\n|--------|-------------------|------------------------------|\n| Prefill tok\u002Fs (30k ctx) | 1,804 | 1,907 (+5.7%) |\n| Decode tok\u002Fs (30k ctx) | 1.264 | 1.303 (+3.1%) |\n| KV cache freed | -- | **30.0 GB** (across 4 GPUs) |\n| Max token capacity | 457,072 | **914,144** (2.0x) |\n| Peak activation memory | 644.6 MB | 599.2 MB (-7.0%) |\n\n### 8x RTX 3090 (24GB each) -- Qwen3.5-35B-A3B MoE (pruned, 205 experts, TP=8)\n\n**Setup**: 8x RTX 3090, vLLM 0.18.0, `gpu_memory_utilization=0.92`, AMD EPYC 7443P 24-Core, 504GB RAM. Model has 10 full-attention layers + 30 linear-attention layers (40 total). TQ compresses only the 10 full-attention layers.\n\n#### Throughput & Latency (Baseline, bf16 KV)\n\n| Context | Prefill tok\u002Fs | Decode tok\u002Fs | TTFT (s) | Needles Found |\n|--------:|--------------:|-------------:|---------:|--------------:|\n| 1,000 | 7,127 | 129.7 | 0.14 | 4\u002F5 |\n| 4,000 | 8,887 | 131.5 | 0.45 | 4\u002F5 |\n| 8,000 | 9,684 | 131.1 | 0.83 | 4\u002F5 |\n| 16,000 | 9,933 | 133.0 | 1.61 | 4\u002F5 |\n| 32,000 | 9,761 | 116.7 | 3.28 | 4\u002F5 |\n| 64,000 | 8,843 | 122.6 | 7.24 | 4\u002F5 |\n| 100,000 | 8,479 | 106.8 | 11.79 | 4\u002F5 |\n| 131,000 | 8,238 | 98.3 | 15.90 | 4\u002F5 |\n\n- **Prefill** saturates around 10k tok\u002Fs, degrades gently to 8.2k at 131k context.\n- **Decode** drops from 133 to 98 tok\u002Fs at 131k (KV readback cost from full-attention layers).\n- **TTFT** scales linearly with context length (purely compute-bound).\n- **Needles** 4\u002F5 found consistently at ALL context lengths -- the model reformats one answer.\n\n#### VRAM Breakdown (per GPU at 131k context)\n\n| Component | Size |\n|-----------|-----:|\n| Total VRAM | 24,576 MB |\n| Reserved (0.92 util) | 22,610 MB |\n| Model weights | ~6,750 MB |\n| KV cache pool | **9,035 MB** |\n| -- full_attention (10 layers) | 3,614 MB |\n| -- linear_attention (30 layers) | 5,421 MB |\n| CUDA overhead + graphs | ~6,825 MB |\n\n#### Baseline vs TurboQuant KV Cache\n\n| Context | Baseline KV\u002FGPU | TQ KV\u002FGPU | Savings\u002FGPU | Savings % |\n|--------:|----------------:|----------:|------------:|----------:|\n| 8,000 | 55.7 MB | 38.5 MB | **17.2 MB** | 30.9% |\n| 32,000 | 191.5 MB | 132.3 MB | **59.3 MB** | 30.9% |\n| 64,000 | 374.3 MB | 258.5 MB | **115.8 MB** | 30.9% |\n| 100,000 | 578.1 MB | 399.2 MB | **178.8 MB** | 30.9% |\n| 131,000 | 755.7 MB | 521.9 MB | **233.8 MB** | 30.9% |\n\n- Savings are **30.9% of total KV** because TQ only compresses the 10 full-attention layers (40% of KV).\n- The 30 linear-attention layers (60% of KV) are **not compressible** by TQ.\n- On a **pure dense transformer**, savings would be **77%** (4.4x compression).\n\n#### Context Extension\n\n| | Tokens | Multiplier |\n|---|-------:|:----------:|\n| Baseline capacity | 1,411,680 | 1.0x |\n| With TQ | 2,043,808 | **1.45x** |\n\nAlternatively, freed VRAM supports **3 additional concurrent 131k-context requests**.\n\n#### Coherence & Quality\n\n| Test | Result |\n|------|--------|\n| Single needle (512-131k tokens) | **PASS** at all lengths |\n| 5-needle at near-max context | **5\u002F5** retrieved |\n| 3-needle multi-fact coherence | **3\u002F3** retrieved |\n| Golden ratio completion (all lengths) | **PASS**, perplexity 1.05-1.35 |\n| Math reasoning at max context | Coherent (model math error from pruning, not context) |\n\n#### TQ Quantization Quality (head_dim=256, measured on GPU)\n\n| Component | cos_sim | Notes |\n|-----------|--------:|-------|\n| TQ key compression (3-bit) | **1.000000** | Near-lossless |\n| TQ key compression (4-bit) | **1.000000** | Near-lossless |\n| Value quantization (2-bit) | 0.940 | Bottleneck for quality |\n| Value quantization (4-bit) | 0.997 | Recommended for quality-sensitive use |\n| Combined (3b key + 2b val) | 0.940 | Value quant dominates degradation |\n\n#### GPU Utilization During Inference\n\n| Context | Peak VRAM\u002FGPU | GPU Util | CPU % | Power |\n|--------:|--------------:|---------:|------:|------:|\n| 1,000 | 22,284 MB | 0% idle | 0.2% | 132W |\n| 32,000 | 22,286 MB | 57% peak | 0.4% | 142W |\n| 131,000 | 22,306 MB | 0% idle | 0.4% | 130W |\n\n- VRAM is **essentially flat** -- KV cache at 131k is only 190 MB\u002FGPU (0.8% of VRAM).\n- No CPU offloading. No KV offloading. Everything fits in VRAM.\n- GPU interconnect is **PCIe** (no NVLink) -- NODE topology between all GPUs.\n\n### Paper Validation (Theorems 1-3)\n\n9 tests validating the paper's theoretical claims:\n\n| Claim | Verdict | Details |\n|-------|---------|---------|\n| MSE distortion bounds (Thm 1) | **PASS** | Within bounds for unit-norm vectors |\n| Codebook MSE matches Table 1 | **PASS** | Lloyd-Max codebook is faithful |\n| Unbiasedness (Thm 2) | **PASS** | Relative bias \u003C 0.1% |\n| Distortion 1\u002F4^b scaling (Thm 3) | **PASS** | 2-bit=0.70x, 3-bit=0.82x, 4-bit=0.97x of bound |\n| Recall@8 (3-bit, N=4096) | **0.55** | Paper threshold met (>=0.40) |\n| Rank correlation (N=2048) | **PASS** | Spearman rho > 0.85 |\n| Needle retrieval | **PASS** | Works at all SNR levels |\n| Compression ratio | **4.41x** | At head_dim=256 on full-attention layers |\n\n### Adversarial Audit\n\nHonest assessment of claims (see `audit_claims.py` for full data):\n\n| Claim | Verdict |\n|-------|---------|\n| \"5.1x compression\" | **Misleading** -- doesn't count Pi\u002FS matrices or ring buffer. Honest: ~4.6x at 4k tokens, ~5x at 32k+ |\n| \"Needle-in-haystack passes\" | **True but trivial** -- query=key test is too easy. Real LLM queries are not copies of keys |\n| \"Recall@8 >= 0.40\" | **Low bar** -- 3-bit recall@1 is only 38%. BUT dominant attention tokens are always preserved |\n| \"Hybrid decode saves memory\" | **Storage yes, compute no** -- dequantizes all history to float32 per decode step |\n| \"Distortion follows 1\u002F4^b\" | **True** -- initial audit was wrong (unnormalized vectors). Unit-norm: within bound |\n| \"30k TQ is faster\" | **Within noise** -- N=1 run, total wall time TQ is actually slower |\n| \"200k context works\" | **Unverified** -- didn't crash, but output quality never checked |\n| \"2x context on dense model\" | **True** -- measured 30 GB freed on Qwen3.5-27B with 4x RTX 3090 |\n\n## How It Works\n\nTurboQuant compresses KV cache entries using:\n1. **Random orthogonal rotation** to spread information across dimensions\n2. **Lloyd-Max optimal scalar quantization** (b-1 bits) on Beta-distributed rotated values\n3. **QJL projection** for residual sign bits (1 bit per dimension)\n4. **Group quantization** for values (2-bit or 4-bit, per-group scales and zeros)\n5. **Bit-packing**: 4 values per byte (2-bit) or 2 per byte (4-bit)\n\nThe combined estimator is **unbiased**: E[estimated inner product] = true inner product.\n\n## Architecture\n\n```\nturboquant\u002F\n  codebook.py          # Lloyd-Max optimal scalar quantizer for Beta distribution\n  codebooks\u002F           # Pre-generated codebook files (d=128\u002F256, bits 2\u002F3\u002F4)\n  rotation.py          # Random orthogonal rotation + QJL projection matrices\n  quantizer.py         # TurboQuantMSE + TurboQuantProd (Algorithms 1 & 2)\n  kv_cache.py          # KV cache manager with value bit-packing\n  capture.py           # Modular KV capture hooks for attention layers\n  store.py             # Compressed KV store (quantize + append + flat cache)\n  score.py             # Attention scoring from compressed keys\n  integration\u002Fvllm.py  # vLLM adapter (monkey-patch, free_kv_cache, hybrid decode)\n  triton_kernels.py    # 3 fused Triton kernels for decode attention\n  vllm_attn_backend.py # Thin shim delegating to integration\u002Fvllm.py\n\nvalidate_paper.py      # 9 tests validating Theorems 1-3\naudit_claims.py        # Adversarial audit of all claims\ntest_modular.py        # 19 modular architecture tests\ntest_turboquant.py     # 7 core quantizer tests\nproof.py               # A\u002FB benchmark (baseline vs TQ)\n\n# Profiling scripts (8x RTX 3090 MoE validation)\nvalidate_moe.py        # Baseline measurements via vLLM API\nvalidate_moe_phase2.py # TQ quality on real GPU with head_dim=256\nvalidate_moe_phase3.py # Logprobs, multi-needle, reasoning at max context\nprofile_100k.py        # Full profiling at 1k-131k context\nprofile_large.py       # Large context (64k-131k) with file-based payloads\nbaseline_vs_tq.py      # VRAM comparison: baseline bf16 vs TQ compressed\nbaseline_vs_tq_v2.py   # Block-level measurement during inference\n```\n\n## Usage\n\n```bash\npip install -e .\n\n# Run paper validation (CPU, no GPU needed)\npython validate_paper.py\n\n# Run adversarial audit\npython audit_claims.py\n\n# Run modular tests\npython -m pytest test_modular.py -v\n\n# Run proof benchmark (requires 4x RTX 3090 + Qwen3.5-27B-AWQ)\nCUDA_VISIBLE_DEVICES=0,1,4,6 python proof.py\n```\n\n## Test Results\n\nAll 35 tests pass:\n- `test_modular.py`: 19\u002F19 (modular architecture)\n- `test_turboquant.py`: 7\u002F7 (core quantizer)\n- `validate_paper.py`: 9\u002F9 (paper theorem validation)\n\n## Limitations\n\n- **Prefill still uses paged cache**: KV cache is allocated at engine init and used during prefill. TQ frees it after. True zero-allocation requires deeper vLLM integration.\n- **Only full-attention layers**: Linear-attention\u002FMamba layers are not compressed.\n- **Value quantization is the bottleneck**: 2-bit values cause cos_sim=0.94 degradation. Use 4-bit values (cos_sim=0.997) for quality-sensitive workloads.\n- **Hybrid decode dequantizes all history**: During compute, all compressed tokens are expanded to float32. The paper's fused Triton kernels exist but the hybrid path doesn't use them yet.\n- **MoE models benefit less**: Models with linear-attention layers (Qwen3.5 MoE, Mamba hybrids) have incompressible state that limits TQ's overall impact.\n\n## Environment\n\nTested on:\n- vLLM 0.18.0, PyTorch 2.10, CUDA 12.8\n- RTX 5090 (32GB) -- Qwen3.5-27B-AWQ, single GPU\n- 8x RTX 3090 (24GB) -- Qwen3.5-35B-A3B MoE, TP=8\n- Python 3.12\n","TurboQuant 是一个用于大语言模型推理的KV缓存量化工具，通过3比特键和2比特值实现近似最优的压缩，并集成了Triton内核与vLLM。其核心功能包括高效的KV缓存压缩，显著减少内存占用并提升处理能力。例如，在RTX 5090上运行Qwen3.5-27B-AWQ模型时，TurboQuant不仅将最大token容量翻倍，还降低了峰值激活内存消耗。此外，它在多GPU环境下同样表现出色，如8块RTX 3090组成的集群中，针对特定架构的模型能够节省约30.9%的KV缓存空间。此项目特别适用于需要优化资源使用效率的大规模语言模型部署场景，尤其是在计算资源有限但对性能要求较高的环境中。",2,"2026-06-11 03:49:39","high_star"]