[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80949":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":11,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":11,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":15,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":18,"topics":19,"createdAt":8,"pushedAt":8,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":13,"starSnapshotCount":13,"syncStatus":14,"lastSyncTime":23,"discoverSource":24},80949,"Mix-Quant","haiquanlu\u002FMix-Quant","haiquanlu",null,"Python",33,1,31,0,2,0.9,false,"main",true,[],"2026-06-12 02:04:08","\u003Cdiv align=\"center\">\n  \u003Ch1> Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs \u003C\u002Fh1>\n\n  \u003Cdiv>\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.20315\">\n      \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Arxiv-darkred.svg\" alt=\"Paper\">\n    \u003C\u002Fa>\n    \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fhaiquanlu.github.io\u002FMix-Quant\u002F\">\n      \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-2f80ed.svg\" alt=\"Project Page\"\u002F>\n    \u003C\u002Fa>\n  \u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n\u003C!-- \u003Cp>Agentic LLM inference is highly input-heavy, creating substantial prefilling overhead; Mix-Quant accelerates this bottleneck with NVFP4 prefilling while preserving BF16 decoding quality.\u003C\u002Fp> -->\n\n\n![intro](assets\u002Fintro.png)\n\n> [**Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.20315)   \n> *[Haiquan Lu](https:\u002F\u002Fgithub.com\u002Fhaiquanlu), [Zigeng Chen](https:\u002F\u002Fczg1225.github.io\u002Fchenzigeng99\u002F), [Gongfan Fang](https:\u002F\u002Ffangggf.github.io\u002F), [Xinyin Ma](https:\u002F\u002Fhorseee.github.io\u002F), [Xinchao Wang](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fsitexinchaowang\u002F)*    \n> *[xML Lab](https:\u002F\u002Fsites.google.com\u002Fview\u002Fxml-nus), National University of Singapore*\n\n------------------\n\n## Introduction\n\nAgentic LLM workflows repeatedly process long contexts from tools, memory, retrieval, and reasoning traces, making prefilling a key inference bottleneck. However, applying low-bit quantization throughout inference can degrade generation quality due to error accumulation.\n**Mix-Quant** addresses this with a phase-aware inference strategy: it applies high-throughput **NVFP4 quantization** to the compute-intensive prefilling stage, while keeping autoregressive decoding in **BF16** for stable and reliable generation. This design accelerates long-context agentic inference while largely preserving downstream task performance.\n\n\n\u003C!-- ![figure](assets\u002Fintro.png) -->\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fframework.png\" width=\"85%\" \u002F>\n  \u003Cbr>\n  \u003Cem>Overview of Mix-Quant.\u003C\u002Fem>\n\u003C\u002Fdiv>\n\n\n## Installation\n\n```bash\n# Create a new conda environment\nconda create -n mix-quant python=3.12 -y\nconda activate mix-quant\n\n# Clone the repository with submodules\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002Fhaiquanlu\u002FMix-Quant.git\ncd Mix-Quant\u002Fvllm\n\n# Install the modified vLLM\n# Note: Mix-Quant is implemented on top of a modified vLLM fork, \n# included as a Git submodule for reproducibility.\n# Option 1: Install with the pre-compiled vLLM wheel.\n# Recommended if the pre-compiled vLLM wheel is compatible with your environment.\nexport VLLM_PRECOMPILED_WHEEL_COMMIT=28ee78af543c563a2fbf78829a7688120e4e4eb5\nVLLM_USE_PRECOMPILED=1 pip install --editable .\n# Option 2: Build vLLM from source.\n# Do NOT run this command if you have already installed vLLM with Option 1.\n# pip install --editable .\n\n# Install other packages\ncd ..\npip install -r requirements.txt\n```\n\n\n## Quick Start\nMix-Quant uses a prefill-decode disaggregated serving pipeline. The script below launches a quantized prefill server, a BF16 decode server, and a lightweight proxy server. After the proxy is ready, users can send standard OpenAI-compatible requests to `http:\u002F\u002Flocalhost:8595\u002Fv1`.\n\n### 1. Start the serving pipeline\n\n```bash\n# Run from the repository root.\nbash scripts\u002Frun_server_qwen3.sh \\\n  --prefill-model-name RedHatAI\u002FQwen3-8B-NVFP4 \\\n  --decode-model-name Qwen\u002FQwen3-8B \\\n  --prefill-gpu 0 \\\n  --decode-gpu 1 \\\n  --tensor-parallel-size 1 \\\n  --max-model-length 131072 \\\n  --proxy-port 8595\n```\n\n### 2. Send a request to the proxy\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    api_key=\"EMPTY\",\n    base_url=\"http:\u002F\u002Flocalhost:8595\u002Fv1\",\n)\n\nresponse = client.chat.completions.create(\n    model=\"Qwen\u002FQwen3-8B\",\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": \"Explain the key idea of Mix-Quant in one sentence.\",\n        }\n    ],\n)\n\nprint(response.choices[0].message.content)\n```\n\n## Evaluation\n\nThe public evaluation entry points are in `scripts\u002F`. Start the serving pipeline first, then run the benchmark scripts from the repository root.\n### 1. Evaluation on Reasoning Benchmarks\n\nSupported datasets are `math500`, `aime24`, `aime25`, and `gsm8k`.\n\nStart the server with native context settings by clearing `--hf-overrides`:\n\n```bash\nbash scripts\u002Frun_server_qwen3.sh \\\n  --prefill-model-name RedHatAI\u002FQwen3-8B-NVFP4 \\\n  --decode-model-name Qwen\u002FQwen3-8B \\\n  --prefill-gpu 0 \\\n  --decode-gpu 1 \\\n  --tensor-parallel-size 1 \\\n  --max-model-length 40960 \\\n  --hf-overrides ''\n```\n\nThen run the evaluation script:\n\n```bash\n# Run the default reasoning set: math500, aime24, aime25.\nbash scripts\u002Feval_qwen3_reasoning.sh \\\n  --seed 42 \\\n  --max-concurrent-requests 32\n```\n\nResults are saved to `evaluation\u002Freasoning\u002Fresults\u002FQwen3-8B\u002Fthinking\u002F`. \n\n### 2. Evaluation on Longbench-v2 Benchmark\n\nThe LongBench-v2 script uses the `Qwen3-8B` model key from `evaluation\u002Flongbench-v2\u002Fconfig\u002F`. \n\n```bash\nbash scripts\u002Feval_qwen3_longbench-v2.sh \\\n  --seed 42 \\\n  --save-dir results\u002Fqwen3-8b\n```\n\nPredictions and per-example correctness are written as JSONL files under `evaluation\u002Flongbench-v2\u002Fresults\u002F`.\n\n### 3. Evaluation on LongMemEval Benchmark\n\nPrepare the LongMemEval data file first:\n\n```base\nmkdir -p evaluation\u002FLongMemEval\u002Fdata\u002F\ncd evaluation\u002FLongMemEval\u002Fdata\u002F\nwget https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fxiaowu0162\u002Flongmemeval-cleaned\u002Fresolve\u002Fmain\u002Flongmemeval_s_cleaned.json\n```\n\nThen run generation:\n\n```bash\nbash scripts\u002Feval_qwen3_longmemeval.sh \\\n  --data-file data\u002Flongmemeval_s_cleaned.json \\\n  --seed 42 \\\n  --save-dir results\u002Fqwen3-8b\n```\n\nThe generation outputs are saved under `evaluation\u002FLongMemEval\u002Fresults\u002F`. LongMemEval QA scoring uses an LLM judge. To run judging in the same command, install the optional judge dependencies, set `OPENAI_API_KEY` and optionally `OPENAI_BASE_URL`, then pass a supported judge model:\n\n```bash\npip install -r evaluation\u002FLongMemEval\u002Frequirements.txt\nexport OPENAI_API_KEY=your_api_key\nbash scripts\u002Feval_qwen3_longmemeval.sh \\\n  --data-file data\u002Flongmemeval_s_cleaned.json \\\n  --judge-model gpt-4o\n```\n\nThe judge output is written next to the prediction file with the `.eval-results-\u003Cjudge-model>` suffix.\n\n## Citation\n\n```\n@article{lu2026mixquant,\n  title={Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs},\n  author={Lu, Haiquan and Chen, Zigeng and Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},\n  journal={arXiv preprint arXiv:2605.20315},\n  year={2026}\n}\n```\n\n## Acknowledgements\n\nThis project builds on several excellent open-source efforts. We sincerely thank the community for their contributions:\n- [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n- [llm-compressor](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fllm-compressor)\n- [evalscope](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fevalscope)\n- [FP-Quant](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002FFP-Quant)\n","Mix-Quant 是一个针对代理型大语言模型（LLMs）设计的混合量化推理框架，旨在通过量化预填充和高精度解码来加速长上下文处理。该项目的核心功能包括使用 NVFP4 量化技术加速计算密集型的预填充阶段，同时保持自回归解码在 BF16 精度下进行，以确保生成质量。这种分阶段的策略不仅提高了推理速度，还保证了下游任务的表现。适用于需要频繁处理长文本输入的场景，如工具交互、记忆检索与逻辑推理等复杂对话系统。项目基于 Python 实现，并提供了详细的安装指南和快速开始教程。","2026-06-11 04:02:58","CREATED_QUERY"]