[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82687":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},82687,"OScaR-KV-Quant","ZunhaiSu\u002FOScaR-KV-Quant","ZunhaiSu","🏆 OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond — redefining the accuracy-efficiency Pareto front for X-LLMs KV quantization.","",null,"C++",134,12,5,1,0,27,46.04,"MIT License",false,"main",true,[],"2026-06-12 04:01:38","\u003Ch1 align=\"center\">\n  \u003Cimg src=\"pictures\u002Foscar.png\" width=\"180\">\u003Cbr>\n  OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond\u003Cbr>\n  \u003Csub style=\"color: #FF6B6B; font-family: cursive;\">\n    ⚡ Data-free · Training & Calibration-free · Plug-and-Play for X-LLMs\n  \u003C\u002Fsub>\n\u003C\u002Fh1>\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTHU-Tsinghua%20University-8B5CF6.svg\" alt=\"THU\">\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHKU-The%20University%20of%20Hong%20Kong-2D68C4.svg\" alt=\"HKU\">\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMeituan-LongCat%20Team-22C55E.svg\" alt=\"Team\">\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUoE-University%20of%20Edinburgh-F4A0B5.svg\" alt=\"UoE\">\u003C\u002Fa>\n  \u003Ca href=\"http:\u002F\u002Farxiv.org\u002Fabs\u002F2605.19660\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.19660-B31B1B.svg\" alt=\"arXiv\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Firidescent-gcrace.github.io\u002FOScaR\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-OScaR-FF6B6B.svg\" alt=\"Website\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## 🔥 Latest News\n\n- **[Upcoming]** 🔧 vLLM & SGLang backend integration — under active development, official support will be announced in future releases.\n\n- **[2026-05-20]** 🎉 Our paper *\"OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond\"* is now available on arXiv! [[Link](http:\u002F\u002Farxiv.org\u002Fabs\u002F2605.19660)]\n\n- **[2026-05-19]** 🚀 Codebase and evaluation suite publicly released.\n\n## 📚 Table of Contents\n\n- [Latest News](#-latest-news)\n- [Overview](#-overview)\n  - [TNI in X-LLMs](#tni-in-x-llms)\n- [Key Features](#-key-features)\n- [Main Results](#-main-results)\n  - [Text-Only LLMs: LongBench-E](#text-only-llms-longbench-e)\n  - [Multi-Modal LLMs: OCRBench](#multi-modal-llms-ocrbench)\n  - [Omni-Modal LLMs: MMAU-Pro](#omni-modal-llms-mmau-pro)\n- [Installation](#-installation)\n- [Quick Start](#-quick-start)\n  - [Smoke Test](#smoke-test)\n  - [Full Benchmark](#full-benchmark)\n  - [Accuracy Evaluation](#accuracy-evaluation-qasper-e)\n  - [Single Example](#single-example)\n- [Citation](#citation)\n- [Acknowledgement](#acknowledgement)\n\n## 📖 Overview\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"pictures\u002Foverview.png\" width=\"90%\">\n\u003C\u002Fdiv>\n\nThe rapid advancement toward **long-context reasoning** and **multi-modal intelligence** has made KV cache memory footprint a dominant bottleneck. We revisit the inherent limitations of the established **per-channel quantization paradigm** and identify **Token Norm Imbalance (TNI)** as the primary bottleneck to quantization fidelity.\n\nRather than relying on intricate pipelines, we follow the principle of **Occam's Razor**. We propose **OScaR (Omni-Scaled Canalized Rotation)** , an accurate and lightweight KV cache compression framework for **X-LLMs (text-only, multi-modal, and omni-modal LLMs)**. \n\n### TNI in X-LLMs\n\n\u003Cdiv align=\"center\">\n  \u003Ctable cellpadding=\"15\" cellspacing=\"0\" style=\"border-collapse: collapse; width: 100%;\">\n    \u003Ctr>\n      \u003Ctd width=\"33%\" align=\"center\">\u003Cstrong>Text-Only LLMs\u003C\u002Fstrong>\u003Cbr>\u003Cimg src=\"pictures\u002FLLM-TNI.png\" width=\"95%\">\u003Cbr>\u003Cem>Low-norm outlier tokens\u003Cbr>(Attention Sink tokens)\u003C\u002Fem>\u003C\u002Ftd>\n      \u003Ctd width=\"33%\" align=\"center\">\u003Cstrong>Multi-Modal LLMs\u003C\u002Fstrong>\u003Cbr>\u003Cimg src=\"pictures\u002FMLLM-TNI.png\" width=\"95%\">\u003Cbr>\u003Cem>Large-norm outliers\u003C\u002Fem>\u003C\u002Ftd>\n      \u003Ctd width=\"33%\" align=\"center\">\u003Cstrong>Multi-Modal LLMs\u003C\u002Fstrong>\u003Cbr>\u003Cimg src=\"pictures\u002FMLLM-TNI-2.png\" width=\"95%\">\u003Cbr>\u003Cem>Inter-modality disparities\u003C\u002Fem>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n> TNI is pervasive across X-LLMs. In text-only models, it manifests as low-norm outlier tokens, also known as Attention Sink tokens. In multi-modal settings, TNI exhibits more diverse forms, including large-norm outliers, significant inter-modality disparities, and broader norm variations. Additional visualizations and detailed experimental configurations are provided in the paper.\n\n\n## ✨ Key Features\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"pictures\u002Foscar-overview.png\" width=\"70%\">\n\u003C\u002Fdiv>\n\n- 🔍 **Unveils TNI as the structural bottleneck** of per-channel quantization through both empirical and theoretical analysis.\n\n- 🪒 **Streamlined OScaR framework** guided by Occam's Razor — requiring only two essential operations, **Canalized Rotation** and **Omni-Token Scaling**, with no training or calibration overhead.\n\n- 📈 **Redefines the Pareto front** for X-LLMs KV quantization, delivering near-lossless INT2 quantization across diverse benchmarks while maintaining low computational complexity.\n\n- ⚡ **Optimized System Design and CUDA kernels** built on BitDecoding and HadaCore with Tensor Core acceleration, achieving 3.0× decoding speedup, 5.3× memory reduction, and 4.1× throughput increase vs. BF16 FlashDecoding-v2.\n\n\n## 📊 Main Results\n\n### Text-Only LLMs: LongBench-E\n\nOScaR achieves the highest average accuracy among all 2-bit methods on LongBench-E, outperforming KIVI, OTT, QuaRot, and TurboQuant+ across both Llama-3.1-8B and Qwen3-8B.\n\n| Method | Llama-3.1-8B | Qwen3-8B |\n|:-------|:------------:|:--------:|\n| 16-bit Baseline | 41.70 | 49.56 |\n| QuaRot (INT2) | 37.94 | 40.13 |\n| RotateKV (INT2) | 37.98 | 42.95 |\n| KIVI (INT2) | 39.84 | 47.95 |\n| OTT (INT2) | 40.74 | 48.21 |\n| TurboQuant+ (2.5-bit) | 40.03 | 47.56 |\n| **OScaR (INT2)** | **41.75** | **48.74** |\n\n### Multi-Modal LLMs: OCRBench\n\nOn OCRBench, OScaR consistently outperforms other 2-bit methods across LLaVA-v1.6-vicuna-7B, Qwen3-VL-8B, and Qwen3-VL-4B.\n\n| Method | LLaVA-v1.6-7B | Qwen3-VL-8B | Qwen3-VL-4B |\n|:-------|:-------------:|:-----------:|:-----------:|\n| 16-bit Baseline | 536 | 858 | 852 |\n| QuaRot (INT2) | 481 | 722 | 773 |\n| RotateKV (INT2) | 473 | 754 | 638 |\n| KIVI (INT2) | 488 | 851 | 813 |\n| OTT (INT2) | 513 | 850 | 831 |\n| TurboQuant+ (2.5-bit) | 501 | 847 | 828 |\n| **OScaR (INT2)** | **519** | **856** | **838** |\n\n### Omni-Modal LLMs: MMAU-Pro\n\nOn the challenging MMAU-Pro benchmark for omni-modal understanding, OScaR surpasses both the 16-bit baseline and all quantized methods across open-ended QA, Good Rate, and Audio Instruction Following (AIF).\n\n| Method (Qwen3-Omni-30B-A3B) | Open-ended | Good Rate | AIF |\n|:---------------------------|:----------:|:---------:|:---:|\n| 16-bit Baseline | 66.2 | 27.8 | 87.4 |\n| KIVI (INT2) | 65.8 | 27.0 | 78.2 |\n| OTT (INT2) | 65.8 | 26.9 | 83.9 |\n| TurboQuant+ (2.5-bit) | 66.6 | 27.0 | 79.3 |\n| **OScaR (INT2)** | **67.4** | **29.8** | **88.5** |\n\n> **Note:** Detailed experimental setups and TurboQuant+ implementation details are available in the original paper.\n\n## 🛠️ Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FZunhaiSu\u002FOScaR-KV-Quant.git OScaR\ncd OScaR\n\n# Prerequisite: install `uv` and ensure it is available on PATH.\nuv venv --python 3.10 --seed oscar-env\nsource oscar-env\u002Fbin\u002Factivate\n\n# Required for CUTLASS headers used by oscar_cuda.\ngit submodule update --init --recursive\n\n# flash-attn imports torch and psutil during its build, so they must exist first.\nuv pip install \"torch==2.6.0+cu124\" psutil --index https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\n\n# Install dependencies declared in pyproject.toml, then install the project itself.\nuv sync --active --no-install-project\nuv pip install --no-build-isolation -e .\n```\n> If you clone with `--recursive`, you should still run `git submodule update --init --recursive` before building to ensure `libs\u002Fcutlass` is present.\n>\n> The Python dependency source of truth is `pyproject.toml`. `tool.uv.sources` pins `torch==2.6.0` to the `cu124` PyTorch index, and `tool.uv.no-build-isolation-package` disables build isolation for `flash-attn`. The explicit torch\u002Fpsutil bootstrap step is still required because `flash-attn` imports them while building but does not declare them as build dependencies. The editable install uses `--no-build-isolation` because this repository's CUDA extension build imports PyTorch from the active environment.\n\n> Tested Environment:\n> - Python `3.10.17`\n> - PyTorch `2.6.0+cu124`\n> - `flash-attn 2.8.3`\n> - `transformers 5.8.1` for a fresh installation from the current `pyproject.toml`\n\n## 🚀 Quick Start\n\nSet the model path:\n\n```bash\nexport MODEL_PATH=\u002Fpath\u002Fto\u002FQwen3-8B\n```\n\n### Accuracy Evaluation (Qasper-E)\nQuick end-to-end accuracy validation using the Qasper-E benchmark:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 $(which python) eval_longbench.py \\\n  --model_path \"$MODEL_PATH\" \\\n  --datasets qasper_e \\\n  --max_input_len 32768 \\\n  --dtype bfloat16 \\\n  --device cuda:0 \\\n  --offline_v_hadamard \\\n  --output_dir pred_e\u002Foscar-qasper \\\n  --log_every 1 \\\n  --resume\n```\n\n> **Note:** This requires the following data files:\n> - `longbench_data\u002Fdata\u002Fqasper_e.jsonl`\n> - `longbench_config\u002Fdataset2prompt.json`\n> - `longbench_config\u002Fdataset2maxlen.json`\n>\n> The metric helper `longbench_metrics.py` is part of this repository, and its Python dependencies are included in `pyproject.toml`.\n\n\n### Single Example\n\nRun a single inference example with explicit configuration:\n\n```bash\nMODEL_PATH=\"${MODEL_PATH}\" \\\nDTYPE=bfloat16 \\\nNUM_BITS=2 \\\nQUANT_MODE=k-channel \\\nGROUP_SIZE=32 \\\nKV_ROTATION=hadamard \\\nKV_NORM=1 \\\nATTN_BACKEND=oscar \\\nbash evaluation\u002Fscripts\u002Fexample.sh\n```\n\n\n## Citation\n\nIf you find OScaR useful for your research or production, please cite our paper:\n\n```bibtex\n@article{su2026oscar,\n  title={OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond},\n  author={Su, Zunhai and Yang, Rui and Zhang, Chao and Liu, Yaxiu and Zhang, Yifan and Wu, Wei and Xiong, Jing and Du, Dayou and Zhuang, Xialie and Qian, Yulei and Xie, Yuchen and Wu, Yik-Chung and Yang, Hongxia and Wong, Ngai},\n  journal={arXiv preprint arXiv:2605.19660},\n  year={2026}\n}\n```\n\n## Acknowledgement\n\nOScaR is inspired by many open-source libraries, including but not limited to [BitDecoding](https:\u002F\u002Fgithub.com\u002FOpenBitSys\u002FBitDecoding), [HadaCore](https:\u002F\u002Fgithub.com\u002Fsegyges\u002Fhadacore), [KIVI](https:\u002F\u002Fgithub.com\u002Fjy-yuan\u002FKIVI), and [SGLang-FluentLLM](https:\u002F\u002Fgithub.com\u002Fmeituan-longcat\u002FSGLang-FluentLLM).\n","OScaR-KV-Quant 是一个针对大语言模型（LLMs）及其扩展中键值缓存（KV Cache）进行极致量化压缩的框架。它通过引入一种名为OScaR（Omni-Scaled Canalized Rotation）的方法，在不依赖复杂流程的情况下，解决了现有逐通道量化方案中存在的Token Norm Imbalance (TNI) 问题，从而实现了高效且准确的KV缓存压缩。该工具采用C++开发，具有无数据、无需训练与校准以及即插即用的特点，适用于文本、多模态乃至全模态的大规模语言模型场景下优化性能与资源消耗之间的平衡。",2,"2026-06-11 04:08:56","CREATED_QUERY"]