[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-78256":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":15,"lastSyncTime":29,"discoverSource":30},78256,"OSCAR","FutureMLS-Lab\u002FOSCAR","FutureMLS-Lab","OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization","https:\u002F\u002Foscar-quantize.github.io\u002F",null,"Python",488,72,37,2,0,12,135,458,68,5.59,false,"main",true,[],"2026-06-12 02:03:46","\u003Cp align=\"center\">\n  \u003Cimg src=\"materials\u002Foscar_logo_kv_transparent.png\" alt=\"OSCAR INT2 KV-Cache\" width=\"180\"\u002F>\n\u003C\u002Fp>\n\n# OSCAR\n\n### Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.17757\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-b31b1b?logo=arxiv&logoColor=white\" alt=\"Paper\"\u002F>\n  \u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Foscar-quantize.github.io\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-oscar--quantize-1f77b4?logo=googlechrome&logoColor=white\" alt=\"Website\"\u002F>\n  \u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FZhongzhu\u002FOSCAR-RotationZoo\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-RotationZoo-FFD21E\" alt=\"HuggingFace RotationZoo\"\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\nOSCAR captures Q\u002FK\u002FV activations on a small calibration set, estimates **attention-aware K\u002FV covariance structures** offline, and derives per-layer rotations + clipping thresholds that align KV quantization with the directions attention actually consumes. The result is **INT2 storage for the bulk of the KV cache** plus a small BF16 sink + recent window — ~7× compression of the KV-cache memory footprint vs BF16, with single-digit pp accuracy drop on GPQA for the dense reasoning models we validated.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"materials\u002FOSCAR_pipeline.png\" alt=\"OSCAR pipeline\" width=\"720\"\u002F>\n\u003C\u002Fp>\n\nOSCAR is built directly into the open-source SGLang framework: clone the repo,\nset up the single environment, and run the dump, rotation, and evaluation\nscripts end to end. It works out of the box, and we also provide a rotation zoo\nso users can download calibrated rotations directly instead of recomputing them.\n\n## 🔥 Latest News\n- **[Upcoming]** OSCAR is testing minimax-m2.7 and GLM in 200K long horizon Agentic Tasks.  Happy to see OSCAR being used in the wild!\n- **[2026-05-18]** Full release: [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.17757), code, [website](https:\u002F\u002Foscar-quantize.github.io\u002F), and [RotationZoo](https:\u002F\u002Fhuggingface.co\u002FZhongzhu\u002FOSCAR-RotationZoo) are all live — runs out of the box on SGLang.\n\n## 📖 Table of Contents\n- [Main results](#main-results)\n- [Layout](#layout)\n- [Setup](#setup)\n- [Quick start (Qwen3-8B example)](#quick-start-qwen3-8b-example)\n- [All configured models](#all-configured-models)\n- [How the rotation is fit (spectral covariance)](#how-the-rotation-is-fit-spectral-covariance)\n- [Serving with the rotation](#serving-with-the-rotation)\n- [Calibration knobs](#calibration-knobs)\n- [Citation](#citation)\n- [License & acknowledgements](#license--acknowledgements)\n\n## Main results\n\u003Cdetails>\n\u003Csummary>\u003Cb>Multi-Modal & LongBench\u003C\u002Fb> \u003C\u002Fsummary>\nUse Rotation and Run Script in zhongzhu\u002FVL branch. Baseline numbers taken from arxiv.org\u002Fabs\u002F2605.19660 (Su et al., 2026).\n\nOCRBench comparison\n| Method                          | Qwen3-VL-8B | Qwen3-VL-4B |\n|---------------------------------|------------:|------------:|\n| 16-bit Baseline                 | 858         | 852         |\n| QuaRot (INT2)                   | 722         | 773         |\n| RotateKV (INT2)                 | 754         | 638         |\n| KIVI (INT2)                     | 851         | 813         |\n| OTT (INT2)                      | 850         | 831         |\n| TurboQuant+ (2.5-bit)           | 847         | 828         |\n| **OSCAR (Lloyd-Max)**           | **854**     | **848**     |\n\nOmni-Modal LLMs: MMAU-Pro\n\n| Method (Qwen3-Omni-30B-A3B) | Open-ended | Good Rate | AIF |\n|:---------------------------|:----------:|:---------:|:---:|\n| 16-bit Baseline | 66.2 | 27.8 | 87.4 |\n| KIVI (INT2) | 65.8 | 27.0 | 78.2 |\n| OTT (INT2) | 65.8 | 26.9 | 83.9 |\n| TurboQuant+ (2.5-bit) | 66.6 | 27.0 | 79.3 |\n| **OSCAR** | **67.4** | **33.8** | **89.7** |\n\nLongBench-E comparison\n\n| Method                          | Qwen3-8B    |\n|---------------------------------|------------:|\n| 16-bit Baseline                 | 49.56       |\n| QuaRot (INT2)                   | 40.13       |\n| RotateKV (INT2)                 | 42.95       |\n| KIVI (INT2)                     | 47.95       |\n| OTT (INT2)                      | 48.21       |\n| TurboQuant+ (2.5-bit)           | 47.56       |\n| **OSCAR**                       | **50.25**   |\n\u003C\u002Fdetails>\n\n**Setup.** Each cell is the **MEAN across 5 reasoning \u002F coding benchmarks** — **GPQA**, **HumanEval**, **LiveCodeBench v6**, **AIME 25**, **MATH-500**. To control single-seed variance, **every benchmark is evaluated 5 times per (model, method) cell** (3 times for GLM-4.7-FP8) and the per-seed scores are averaged before being averaged across benchmarks. TurboQuant rows are single-run (\\*) because its vLLM path is too slow for repeated 32K-context evaluations under our compute budget. All runs use **32K-token max generation length**. **BPE** = effective bits per KV element at 128K context length. Higher is better; the BF16 row is the upper bound.\n\n| Method | BPE | Qwen3-4B&nbsp;Thinking | Qwen3-8B | Qwen3-32B | GLM-4.7-FP8&nbsp;(358B) |\n|:---|:---:|:---:|:---:|:---:|:---:|\n| BF16 (upper bound) | 16.00 | 75.64 | 70.84 | 74.19 | 77.89 |\n| Saw-INT4 | 4.25 | 73.11 | 69.97 | 74.43 | 77.95 |\n| TurboQuant K3V3 \\* | 3.25 | 31.74 | 56.88 | 71.99 | 78.15 |\n| QuaRot-INT2 | 2.25 | 1.40 | 10.14 | 7.90 | 75.14 |\n| Naive INT2 | 2.25 | 0.00 | 0.00 | 0.00 | 60.49 |\n| **OSCAR (ours)** | **2.28** | **71.86** | **69.42** | **74.17** | **78.16** |\n| _Gap of OSCAR vs BF16_ | | _−3.78_ | _−1.42_ | _−0.02_ | _+0.27_ |\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Details for each task \u003C\u002Fb> \u003C\u002Fsummary>\n\u003Cimg width=\"1404\" height=\"1052\" alt=\"image\" src=\"materials\u002Fdetail_table.png\" \u002F>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Baseline notes\u003C\u002Fb> — TurboQuant \u002F QuaRot \u002F Saw-INT4 \u002F Naive INT2 configurations\u003C\u002Fsummary>\n\nFor a fair comparison at a comparable bit-budget, **TurboQuant** results use\nvLLM's implementation\n([docs](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fapi\u002Fvllm\u002Fmodel_executor\u002Flayers\u002Fquantization\u002Fturboquant\u002F))\nmodified so that **all layers are quantized** (no mixed precision); the\noriginal TurboQuant keeps the first, last, and selected middle layers in\nfull precision. We run it in its **K3V3** configuration (3-bit K, 3-bit V)\nto land near the OSCAR bit-budget.\n\n**QuaRot-INT2** is the standard 2-bit KV-quant recipe (data-free Hadamard\nrotation per layer). **Saw-INT4** is an INT4 reference for context.\n**Naive INT2** is per-token symmetric INT2 with no rotation.\n\n\\* TurboQuant entries are single-run results because its vLLM path is too\nslow for repeated 32K-context evaluations under our compute budget.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Comparison with other INT2 KV-cache methods on AIME25\u003C\u002Fb>\u003C\u002Fsummary>\n\nMost prior INT2 KV-cache methods do not provide framework-level support for\nefficient long-context generation, so 32K-generation evaluations are extremely\nslow and their papers do not report the full benchmark suite above. For this\nreason, we compare against the reported AIME25 setting where public numbers are\navailable.\n\n| Method | BPE | Qwen3-8B | Qwen3-32B |\n|:---|:---:|:---:|:---:|\n| Original BF16 | 16.00 | 66.00 +\u002F- 7.33 | 72.59 +\u002F- 7.41 |\n| KIVI-KV2 | 2.25 | 52.33 +\u002F- 9.00 | 57.41 +\u002F- 9.26 |\n| KIVI-KV2* | 2.26 | 57.67 +\u002F- 9.00 | 59.05 +\u002F- 12.38 |\n| Kitty | 2.39 | 59.67 +\u002F- 10.33 | 69.26 +\u002F- 9.26 |\n| **OSCAR (ours)** | **2.38** | **66.67 +\u002F- 3.33** | **74.00 +\u002F- 5.48** |\n\nOSCAR is the only INT2 method in this comparison that reaches BF16-level AIME25\naccuracy at 32K generation while staying near a 2-bit KV-cache budget.\n\n\u003C\u002Fdetails>\nOSCAR is the only INT2 method that stays within a few pp of BF16 across\nevery model. QuaRot-INT2 and naive INT2 collapse on reasoning + coding\ntasks. Saw-INT4 is a strong INT4 reference, but OSCAR matches or beats it\n**at roughly half the storage** (≈2 bits per KV element).\n\n## Layout\n\n```\nrotation\u002F\n  eval_oscar_gpqa.sh        generic GPQA eval driver\n  eval_oscar_lcb.sh         generic LiveCodeBench v6 (128K) eval driver\n  compute_kv_rotation.py    eigendecomposition + R·H·P_br composition\n  _dump_compat\u002F             sgl_kernel compat shim for dump\n  \u003Cmodel>\u002F\n    save_qkv_\u003Cmodel>.sh     phase 1 — dump\n    compute_rotation.sh     phase 2 — rotation\n    eval_gpqa.sh            phase 3 — GPQA eval\n    eval_lcb.sh             phase 3 — LCB v6 (128K) eval (where applicable)\n    GPQA\u002F\n      seq\u003CT>_prompt\u003CN>_group\u003CG>\u002F\n        qkv_dumps\u002F          dump output\n        rotations\u002F          rotation .pt files\n        _eval_gpqa_oscar\u002F   eval results from this rotation\n        _eval_lcb_v6_128k\u002F  ...\n\nsglang-research\u002F            submodule — INT2 KV eval\nsglang-dump-qkv\u002F            vendored older sglang-fork — QKV dump (loaded via shim)\n```\n\n## Setup\n\n### Requirements\n\n- 1 × H100 80 GB (for 4B\u002F8B), 4 × H100 (for 32B \u002F MiniMax-M2.7), 8 × H100 (for GLM-4.7-FP8)\n- CUDA 12.8 or 12.9 (nvcc on `$PATH`)\n- Python 3.12 + Conda\n- HuggingFace access for the relevant model weights\n\n### Clone\n\n```bash\ngit clone --recursive https:\u002F\u002Fgithub.com\u002FFutureMLS-Lab\u002FOSCAR.git\ncd OSCAR\n```\n\n### Conda env (single env, dump + eval)\n\nOSCAR uses **one** conda env for both dump and eval. The dump-side sglang\n(vendored as `sglang-dump-qkv\u002F`) was originally built against an older\n`sgl_kernel`; OSCAR ships a thin `rotation\u002F_dump_compat\u002F` shim that stubs\nthe dropped legacy symbols at import time and falls back to PyTorch for\nthe runtime sampling kernels it references, so a single eval-side env\nsuffices.\n\n```bash\nconda create -n oscar python=3.12 -y\nconda activate oscar\n\n# Eval-side sglang (editable so future patches stick)\npip install -e sglang-research\u002Fpython\n\n# CUDA-12.8\u002F12.9 compatible flashinfer + sgl_kernel build\n# (see https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang for matching wheels)\n```\n\nIf `nvcc` and PyTorch's CUDA versions diverge (e.g. nvcc 12.6 but torch\nbuilt for 12.8), the JIT kernels in flashinfer may fail to compile. Pin\n`CUDA_HOME` to the matching `cuda-12.x` directory before launching.\n\n## Quick start (Qwen3-8B example)\n\nEnd-to-end on a single H100, ~20 minutes total.\n\n```bash\ncd OSCAR\n\n# Phase 1 — dump Q\u002FK\u002FV (TP=1, default DUMP_KVCACHE_TOKENS=30000)\nbash rotation\u002Fqwen3-8B\u002Fsave_qkv_8b.sh\n# → writes rotation\u002Fqwen3-8B\u002FGPQA\u002Fseq30000_prompt\u003CN>_group128\u002Fqkv_dumps\u002F\n\n# Phase 2 — fit the calibrated rotation\nbash rotation\u002Fqwen3-8B\u002Fcompute_rotation.sh\n# → writes rotation\u002Fqwen3-8B\u002FGPQA\u002Fseq30000_prompt\u003CN>_group128\u002Frotations\u002F{k,v}_rotation_qqt_r_h_pbr.pt\n\n# Phase 3 — GPQA eval against the rotation we just produced\nROT_DIR=rotation\u002Fqwen3-8B\u002FGPQA\u002Fseq30000_prompt\u003CN>_group128\u002Frotations \\\n  bash rotation\u002Fqwen3-8B\u002Feval_gpqa.sh\n# → writes results to rotation\u002Fqwen3-8B\u002FGPQA\u002Fseq30000_prompt\u003CN>_group128\u002F_eval_gpqa_oscar\u002F\n```\n\nPick the actual `seq...prompt..._group...` tag printed by phase 1, or:\n\n```bash\nROT_DIR=$(ls -1d rotation\u002Fqwen3-8B\u002FGPQA\u002Fseq*_prompt*_group*\u002Frotations | tail -1) \\\n  bash rotation\u002Fqwen3-8B\u002Feval_gpqa.sh\n```\n\n## All configured models\n\n| Folder | HF model | TP (dump) | TP (eval) | Notes |\n|---|---|---|---|---|\n| `rotation\u002Fqwen3-4B-thinking-2507\u002F` | `Qwen\u002FQwen3-4B-Thinking-2507` | 1 | 1 | thinking model |\n| `rotation\u002Fqwen3-8B\u002F` | `Qwen\u002FQwen3-8B` | 1 | 1 | |\n| `rotation\u002Fqwen3-32B\u002F` | `Qwen\u002FQwen3-32B` | 2-4 | 4 | |\n| `rotation\u002FMiniMax-M2.7\u002F` | `MiniMaxAI\u002FMiniMax-M2.7` | 4 | 4 | FP8 weights, `--reasoning-parser minimax-append-think` |\n| `rotation\u002FGLM-4.7\u002F` | `zai-org\u002FGLM-4.7-FP8` | 8 | 8 | FP8 weights, 92 layers |\n\n## How the rotation is fit (spectral covariance)\n\nFor each transformer layer, given calibration `(Q, K, V)` activations, OSCAR estimates two attention-aware **covariance** matrices and uses their eigenspectra to derive rotations:\n\n- **K covariance** (`qqt`) — average attention-query covariance seen by K:\n  `Σ_K = (1\u002FH_kv) · Σ_h Q_h^T Q_h \u002F n_tokens` (GQA-aware: query heads grouped under the matching KV head)\n- **V covariance** (`sst`) — score-weighted V-side covariance:\n  `Σ_V = (1\u002FH_kv) · Σ_h V_h^T diag(w_h) V_h \u002F n_tokens` where `w_h[t] = K_h[t] · (Q^T Q) · K_h[t]^T` is the per-token attention-score weight derived from K and the Q covariance\n- `torch.linalg.eigh(Σ)` → orthogonal eigenvectors `R` plus the eigenvalues (used for ordering, not for scaling)\n- Composition `r_h_pbr`: `R_loaded = R · H_d · P_br`\n  - `H_d` — head-dim Hadamard\n  - `P_br` — bit-reversal permutation, sorted by eigenvalue magnitude; this interleaves high-variance directions evenly across quant groups so no single group concentrates outliers\n\nSaved as fp32 per-layer `(head_dim, head_dim)` orthogonal matrices in\n`\u003Ccalib_dir>\u002Frotations\u002F{k,v}_rotation_qqt_r_h_pbr.pt`.\n\n## Serving with the rotation\n\nThe eval driver `eval_oscar_gpqa.sh` and `eval_oscar_lcb.sh` set everything for you. The underlying sglang server flags are:\n\n```bash\nSGLANG_ENABLE_MIXED_KV_WINDOWS=1 \\\nSGLANG_OSCAR_K_ROTATION_PATH=...\u002Fk_rotation_qqt_r_h_pbr.pt \\\nSGLANG_OSCAR_V_ROTATION_PATH=...\u002Fv_rotation_sst_r_h_pbr.pt \\\nSGLANG_OSCAR_K_CLIP_RATIO=0.96 \\\nSGLANG_OSCAR_V_CLIP_RATIO=0.92 \\\nSGLANG_OSCAR_ABSORB_V_ROTATION=1 \\\nSGLANG_MIXED_KV_PREFIX_TOKENS=64 \\\nSGLANG_MIXED_KV_RECENT_TOKENS=256 \\\nSGLANG_MIXED_KV_HP_MAX_SPLITS=8 \\\nSGLANG_MIXED_KV_HP_DTYPE=bfloat16 \\\nSGLANG_MIXED_KV_SCALE_DTYPE=float32 \\\npython -m sglang.launch_server \\\n  --model-path \u003Cmodel> \\\n  --tensor-parallel-size \u003Ctp> \\\n  --kv-cache-dtype int2 \\\n  --kv-cache-quant-group-size 128 \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --trust-remote-code\n```\n\nSink (`PREFIX_TOKENS`) and recent window (`RECENT_TOKENS`) stay in BF16; the rest of the cache is INT2-quantized into 128-element groups along head-dim.\n\n## Calibration knobs\n\nOverride per `bash rotation\u002F\u003Cmodel>\u002Fsave_qkv_\u003Cmodel>.sh ENV=val`:\n\n| Env | Default | Effect |\n|---|---|---|\n| `DUMP_KVCACHE_TOKENS` | 30000 | Total token budget for calibration |\n| `GROUP_SIZE` | 128 | KV quant group size, encoded in output dir name |\n| `DATASET` | GPQA | Calibration dataset name |\n| `MODEL` | per-model HF id | HuggingFace model id |\n| `TP_SIZE` | per-model | Tensor parallel size for dump |\n| `GPU` | per-model | CUDA_VISIBLE_DEVICES |\n| `HF_HOME` | `\u002Fshared\u002Fhuggingface` | HF cache (set to `$HOME\u002F.cache\u002Fhuggingface` on a fresh machine) |\n\n## Citation\n\n```bibtex\n@misc{zhou2026oscarofflinespectralcovarianceaware,\n      title={OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},\n      author={Zhongzhu Zhou and Donglin Zhuang and Jisen Li and Ziyan Chen and Shuaiwen Leon Song and Ben Athiwaratkun and Xiaoxia Wu},\n      year={2026},\n      eprint={2605.17757},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.17757},\n}\n```\n\n## License & acknowledgements\n\n- Released under the MIT License.\n- Built on top of [sglang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang).\n","OSCAR 是一个用于2比特KV缓存量化的离线频谱协方差感知旋转技术。该项目通过在小校准集上捕获Q\u002FK\u002FV激活，估计注意力感知的K\u002FV协方差结构，并推导出每层旋转和剪裁阈值，使得KV量化与实际使用的注意力方向对齐。核心功能包括INT2存储大部分KV缓存以及少量BF16存储，相比BF16压缩了约7倍的KV缓存内存占用，同时保持了单数百分比精度下降。适用于需要高效利用内存资源进行大规模模型推理的场景，如长序列任务或多模态处理。项目基于Python开发，直接集成到开源SGLang框架中，用户可以轻松设置环境并运行相关脚本。","2026-06-11 03:56:41","CREATED_QUERY"]