[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80876":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":12,"lastSyncTime":27,"discoverSource":28},80876,"UniPrefill","qhfan\u002FUniPrefill","qhfan","Implementation of \"UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification\"",null,"Python",39,2,34,0,1,4,5,3,1.43,false,"master",true,[],"2026-06-12 02:04:07","# UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002FXXXX.XXXXX\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-paper-red?logo=arxiv\" alt=\"arXiv\">\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FvLLM-v0.16.0-blue?logo=github\" alt=\"vLLM version\">\u003C\u002Fa>\n  \u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-green\" alt=\"License\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## Abstract\n\nAs large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full-attention models. When transferred to emerging architectures — such as linear\u002Ffull attention hybrids or sliding window\u002Ffull attention hybrids — these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into modern inference engines such as vLLM.\n\nTo this end, we propose **UniPrefill**, a prefill acceleration framework applicable to virtually any model architecture, which directly accelerates the model's computation at the token level. We further implement UniPrefill as a continuous batching operator and extend vLLM's scheduling strategy to natively support prefill-decode co-processing and tensor parallel for UniPrefill, enabling its seamless integration into vLLM. UniPrefill achieves up to **2.1× speedup** in Time-To-First-Token (TTFT), with the acceleration becoming increasingly pronounced as the number of concurrent requests grows.\n\n---\n\n## Key Features\n\n- **Architecture-agnostic prefill acceleration** — works on full-attention, linear\u002Ffull hybrids, and sliding-window\u002Ffull hybrids, unlike sparse-attention-only approaches\n- **Continuous batching compatible** — implemented as a drop-in batching operator, natively supported by vLLM's scheduling pipeline\n- **Tensor parallel support** — UniPrefill's scheduling strategy is extended to support multi-GPU tensor parallelism\n- **Prefill-decode co-processing** — simultaneous prefill and decode within the same engine, improving GPU utilization\n- **Up to 2.1× TTFT speedup** — gains grow with the number of concurrent requests\n\n---\n\n## Installation\n\nThis implementation is based on **vLLM v0.16.0**.\n\n~~~bash\ngit clone https:\u002F\u002Fgithub.com\u002Fqhfan\u002FUniPrefill.git\npip install -r requirements.txt\ncd UniPrefill\u002Fvllm-releases-v0.16.0\nbash setup.sh\n~~~\n\n> **Note:** We recommend using a clean conda environment with Python 3.10+ and CUDA 12.1+ before running `setup.sh`.\n\n---\n\n## Supported Models\n\n| Model Family | File Modified |\n|---|---|\n| LLaMA-3.1 | `vllm\u002Fmodel_executor\u002Fmodels\u002Fllama.py` |\n| Qwen3-Next | `vllm\u002Fmodel_executor\u002Fmodels\u002Fqwen3_next.py` |\n| Gemma3 | `vllm\u002Fmodel_executor\u002Fmodels\u002Fgemma3.py` |\n\n---\n\n## Code Changes Overview\n\nUniPrefill's modifications to the vLLM codebase are minimal and well-contained. The key changed files are listed below:\n\n| File | Description |\n|---|---|\n| `vllm\u002Fmodel_executor\u002Flayers\u002Ffused_top_p_selection_tp_pd.py` | Core UniPrefill operator: block-wise dynamic sparsification with tensor parallel and prefill-decode support |\n| `vllm\u002Fmodel_executor\u002Fmodels\u002Fllama.py` | LLaMA-3.1 model integration |\n| `vllm\u002Fmodel_executor\u002Fmodels\u002Fqwen3_next.py` | Qwen3-Next model integration |\n| `vllm\u002Fmodel_executor\u002Fmodels\u002Fgemma3.py` | Gemma3 model integration |\n| `vllm\u002Fv1\u002Fattention\u002Fbackends\u002Fflash_attn.py` | Modified `FlashAttnImpl.forward` and KV cache update logic |\n| `vllm\u002Fv1\u002Fattention\u002Fbackends\u002Ftriton_attn.py` | Modified `TritonAttnImpl.forward` and KV cache update logic |\n| `vllm\u002Fv1\u002Fworker\u002Fgpu_model_runner.py` | Per-request per-layer sequence length tracking across requests |\n| `vllm\u002Fforward_context.py` | Per-layer token count variable maintained across the forward pass |\n| `native_imp.py` | PyTorch reference implementation for algorithm illustration only; supports `batch_size == 1` solely |\n\n---\n\n## ⚠️ Known Issues\n\n> **Warning:** When running with tensor parallelism (`tp > 1`), there is an intermittent bug that may cause the model to output repeated `!` characters (e.g., `!!!!!!!!!!!!!`). The root cause is under investigation. Please use `tp > 1` with caution in production environments and validate outputs when deploying at scale.\n\n---\n\n## Citation\n\nIf you find UniPrefill useful in your research, please cite our paper:\n\n~~~bibtex\n@article{uniprefill2026,\n  title     = {UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification},\n  author    = {Qihang Fan, Huaibo Huang, Zhiying Wu, Bingning Wang, Ran He},\n  journal   = {arXiv preprint arXiv:2605.06221},\n  year      = {2026}\n}\n~~~\n\n---\n\n## Acknowledgements\n\nThis project builds on top of [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm). We thank the vLLM team for their excellent open-source inference engine.\n","UniPrefill 是一个旨在通过块级动态稀疏化技术加速长上下文预填充处理的框架。其核心功能包括架构无关的预填充加速，支持全注意力、线性\u002F全混合及滑动窗口\u002F全混合模型；与连续批处理兼容，并作为即插即用的批处理操作符被vLLM调度策略原生支持；支持多GPU张量并行和预填充-解码协同处理，从而提高GPU利用率。在时间到第一个令牌（TTFT）方面，UniPrefill可实现高达2.1倍的速度提升，尤其适用于并发请求较多的场景。该工具非常适合需要高效处理长上下文输入的大规模语言模型应用环境。","2026-06-11 04:02:39","CREATED_QUERY"]