[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83205":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":10,"trendingCount":15,"starSnapshotCount":15,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},83205,"KVarN","huawei-csl\u002FKVarN","huawei-csl","KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.03458",null,"Python",389,20,5,0,1,121,24,3.97,"Apache License 2.0",false,"main",[24,25,26,27,28,29,30],"agentic-ai","kv-cache","llm","llm-inference","long-context","quantization","vllm","2026-06-12 02:04:32","[![Built on vLLM](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBuilt%20on-vLLM%20v0.22.0-30a14e)](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2606.03458-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.03458)\n[![hf-space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Huawei%20CSL-ffc107?color=ffc107&logoColor=white)](https:\u002F\u002Fhuggingface.co\u002Fhuawei-csl)\n[![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuawei-csl\u002FKVarN?label=Stars&logo=github&logoColor=white&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fhuawei-csl\u002FKVarN\u002Fstargazers)\n\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"imgs\u002Flogo_600.png\" alt=\"KVarN\" width=\"640\">\n\u003C\u002Fp>\n\n> ⚡️ **Built for agentic and long-context workloads.**\n\n> 💡 KVarN delivers **3-5x more KV-cache capacity** and **up to ~1.3x the throughput** of FP16, so you fit far longer contexts and serve more concurrent requests, with **FP16-level accuracy**.\n\n> 🔌 **Calibration-free, plug-and-play with vLLM.** A native vLLM attention backend: add one flag, no model changes, no calibration.\n\n> 🥊 **Up to ~2.4× TurboQuant throughput**, same capacity, **higher accuracy**.\n\n---\n\n## Why KVarN (Variance Normalized KV-Cache)?\n\n> **kvarn** \u002Fkvɑːɳ\u002F &nbsp;·&nbsp; *noun* (Swedish)\n>\n> 1. A grinding apparatus used to reduce substances into smaller particles or\n>    powder, especially grains, seeds, spices, coffee beans, KV-caches.\n\nKV-cache quantization usually comes with a catch. As the\n[vLLM TurboQuant blog](https:\u002F\u002Fvllm.ai\u002Fblog\u002F2026-05-11-turboquant) shows, existing\nmethods buy extra KV-cache capacity but **give up throughput** (TurboQuant reports\n**40 to 52% lower throughput** for 2.3-3.7x capacity), and aggressive low-bit\nquantization also tends to **cost accuracy**. Losing both speed *and* quality is\nthe main reason KV-cache quantization is rarely turned on in production.\n\n**KVarN is built to keep both.** On Qwen3-32B (AIME25, 16K-context burst, TP=2) it\nmatches FP16 accuracy and **beats its throughput** while delivering ~4× the KV-cache capacity:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"imgs\u002Fpareto_qwen3-32b.png\" alt=\"KVarN vs FP16 vs TurboQuant: accuracy, throughput and capacity\" width=\"660\">\n\u003C\u002Fp>\n\nKVarN stays in the upper-right corner the blog's methods can't reach: **FP16-level\naccuracy, FP16-or-better throughput, and several times the context.**\n\n---\n\n## Quickstart\n\nKVarN ships as a vLLM fork. Install it like vLLM, then select the KVarN KV-cache dtype.\n\n```bash\n# 1. Clone\ngit clone https:\u002F\u002Fgithub.com\u002Fhuawei-csl\u002FKVarN.git\ncd KVarN\n\n# 2. Install (uses the upstream precompiled wheel; KVarN kernels are Triton, JIT-compiled at runtime)\nVLLM_USE_PRECOMPILED=1 pip install -e .\n```\n\n```python\nfrom vllm import LLM, SamplingParams\n\nllm = LLM(\n    model=\"Qwen\u002FQwen3-32B\",\n    dtype=\"float16\",                    # KVarN runs in float16\n    kv_cache_dtype=\"kvarn_k4v2_g128\",   # enable KVarN\n    block_size=128,                     # KVarN tile size\n)\nprint(llm.generate(\"Explain KV-cache quantization in one sentence.\",\n                    SamplingParams(max_tokens=64))[0].outputs[0].text)\n```\n\nServing works the same way:\n\n```bash\nvllm serve Qwen\u002FQwen3-32B --dtype float16 --kv-cache-dtype kvarn_k4v2_g128 --block-size 128\n```\n\n> **Note:** KVarN runs in `float16` compute. The tile \u002F page size is currently\n> fixed at 128 (one vLLM block = one KVarN tile); other page sizes are coming soon.\n\n> **Tip (capacity):** KVarN realizes its full KV-cache capacity when there is room\n> to amortize a small fixed decode workspace. On multi-GPU or generous\n> `--gpu-memory-utilization` setups this is automatic. On a tight single-GPU budget,\n> vLLM's CUDA-graph memory profiler can over-reserve and shrink the KV pool; set\n> `VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0` (and\u002For raise\n> `--gpu-memory-utilization`) to recover the full capacity.\n\n---\n\n## How does KVarN work?\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"imgs\u002Fkvarn_pipeline.gif\" alt=\"KVarN pipeline: Cache, Rotated Cache, Normalized Cache, Quantized Cache\" width=\"760\">\n\u003C\u002Fp>\n\nKVarN quantizes the KV cache one fixed-size token tile at a time, walking each tile\nthrough the four stages above:\n\n1. **Cache**: the raw fp16 KV tile (channels × tokens), straight from attention.\n\n2. **Rotated Cache**: a **Hadamard rotation** along the channel dimension mixes\n   channels so that per-channel outliers are spread out, making the tile easier to\n   quantize. The rotation is orthonormal, so attention scores are preserved.\n\n3. **Normalized Cache**: **iterative variance normalization** (Sinkhorn-like)\n   alternates column- and row-wise standard-deviation normalization in log space,\n   equalizing variance across the tile and shrinking quantization error before any\n   rounding happens.\n\n4. **Quantized Cache**: **asymmetric round-to-nearest** at low bit-width, with the\n   scales folded back in at read time (keys per channel, values per token).\n\nThe shipped preset spends **more bits on keys than values** (`kvarn_k4v2_g128`:\n4-bit keys, 2-bit values). We chose to release this configuration because it meets\nthe strictest accuracy bar, matching FP16, that the most demanding production\ndeployments and vLLM require, while still delivering throughput above FP16.\n\n---\n\n## Citation\n\nKVarN is the official vLLM implementation of our paper:\n\n> 📄 *KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation\n> in Reasoning Tasks* ([arXiv:2606.03458](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.03458))\n\nIf you use KVarN, please cite:\n\n```bibtex\n@misc{muller2026kvarn,\n      title={KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks}, \n      author={Lorenz K. Muller and Philippe Bich and Chiara Boretti and Hyun-Min Chang and Jiawei Zhuang and Lukas Cavigelli},\n      year={2026},\n      eprint={2606.03458},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={http:\u002F\u002Farxiv.org\u002Fabs\u002F2606.03458}\n}\n```\n\n---\n\n## License and attribution\n\nKVarN is built on [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) (v0.22.0) and is\nreleased under the Apache 2.0 License. The original vLLM README is preserved as\n[`README_vLLM.md`](README_vLLM.md).\n",2,"2026-06-11 04:10:24","CREATED_QUERY"]