[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71173":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":41,"readmeContent":42,"aiSummary":43,"trendingCount":16,"starSnapshotCount":16,"syncStatus":44,"lastSyncTime":45,"discoverSource":46},71173,"Awesome-LLM-Inference","xlite-dev\u002FAwesome-LLM-Inference","xlite-dev","📚A curated list of Awesome LLM\u002FVLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8\u002F4, Parallelism, etc.🎉","",null,"Python",5282,385,134,3,0,9,21,70,27,38.76,"GNU General Public License v3.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40],"awesome-llm","deepseek","deepseek-r1","deepseek-v3","flash-attention","flash-attention-3","flash-mla","llm-inference","minimax-01","mla","paged-attention","qwen3","tensorrt-llm","vllm","2026-06-12 02:02:48","\n\u003Cdiv align='center'>\n  \u003Cimg src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffcd83ff2-7ace-4fb5-8d3b-3ccbc1ecbf87 width=250px >\n\u003C\u002Fdiv>\n\n\u003Cdiv align='center'>\n  \u003Cimg src=https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdownloads\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Ftotal?color=ccf&label=downloads&logo=github&logoColor=lightgrey >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FAwesome-LLM-Inference.svg?style=social >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRelease-v2.6-brightgreen.svg >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-GPLv3.0-turquoise.svg >\n \u003C\u002Fdiv>\n\n## 📒Introduction\nAwesome-LLM-Inference: A curated list of [📙Awesome LLM Inference Papers with Codes](#paperlist). For Awesome Diffusion Inference, please check 📖[Awesome-DiT-Inference](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-DiT-Inference)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FAwesome-DiT-Inference.svg?style=social). For CUDA learn notes, please check 📖[LeetCUDA](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FLeetCUDA)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FLeetCUDA.svg?style=social).\n\n## 📖 News 🔥🔥\n\u003Cdiv id=\"news\">\u003C\u002Fdiv>\n\n- [2026\u002F03] Cache-DiT **[🎉v1.3.0](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit)** release is ready, the major updates including: [Ring](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL) Attention w\u002F [batched P2P](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL), [USP](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL\u002F) (Hybrid Ring and Ulysses), Hybrid 2D and 3D Parallelism (💥[USP + TP](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FHYBRID_PARALLEL\u002F)),  VAE-P Comm overhead reduce.\n\n![arch](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit\u002Fraw\u002Fmain\u002Fassets\u002Farch_v2.png)\n\n## ©️Citations\n\n```BibTeX\n@misc{Awesome-LLM-Inference@2024,\n  title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},\n  url={https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference},\n  note={Open-source software available at https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference},\n  author={xlite-dev, liyucheng09 etc},\n  year={2024}\n}\n```\n\n## 🎉Awesome LLM Inference Papers with Codes\n\n[Awesome LLM Inference for Beginners.pdf](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Freleases\u002Fdownload\u002Fv0.3\u002FAwesome-LLM-Inference-v0.3.pdf.zip): 500 pages, FastServe, FlashAttention 1\u002F2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8\u002F4, Continuous Batching, ZeroQuant 1\u002F2\u002FFP, AWQ etc.\n\n\u003Cdiv align='center'>\n\u003Cimg src=https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fassets\u002F31974251\u002F0ed77e9d-a1eb-4095-9a82-bad624964e55 >\n\u003C\u002Fdiv>\n\n## 🎉Download All PDFs\n```bash\npython3 download_pdfs.py # The code is generated by Doubao AI\n```\n\u003Cimg width=\"1267\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F5fbf20a0-2998-46eb-bba1-ea7e7b8ecfd5\" \u002F>\n\n\u003Cdiv id=\"paperlist\">\u003C\u002Fdiv>\n\n## 📖Contents\n* 📖[Trending LLM\u002FVLM Topics](#Trending-LLM-VLM-Topics)🔥🔥🔥\n* 📖[DeepSeek\u002FMLA Topics](#mla)🔥🔥🔥\n* 📖[Multi-GPUs\u002FMulti-Nodes Parallelism](#DP-MP-PP-TP-SP-CP)🔥🔥🔥\n* 📖[Disaggregating Prefill and Decoding](#P-D-Disaggregating)🔥🔥🔥\n* 📖[LLM Algorithmic\u002FEval Survey](#LLM-Algorithmic-Eval-Survey)\n* 📖[LLM Train\u002FInference Framework\u002FDesign](#LLM-Train-Inference-Framework)\n* 📖[Weight\u002FActivation Quantize\u002FCompress](#Weight-Activation-Quantize-Compress)🔥\n* 📖[Continuous\u002FIn-flight Batching](#Continuous-In-flight-Batching)\n* 📖[IO\u002FFLOPs-Aware\u002FSparse Attention](#IO-FLOPs-Aware-Attention-Sparse)🔥\n* 📖[KV Cache Scheduling\u002FQuantize\u002FDropping](#KV-Cache-Scheduling-Quantize-Dropping)🔥\n* 📖[Prompt\u002FContext Compression](#Context-Compression)🔥\n* 📖[Long Context Attention\u002FKV Cache Optimization](#Long-Context-Attention-KVCache)🔥🔥\n* 📖[Early-Exit\u002FIntermediate Layer Decoding](#Early-Exit)\n* 📖[Parallel Decoding\u002FSampling](#Parallel-Decoding-Sampling)🔥\n* 📖[Structured Prune\u002FKD\u002FWeight Sparse](#Structured_Pruning_KD_Weight_Sparse)\n* 📖[Mixture-of-Experts(MoE) LLM Inference](#Mixture_of_Experts_LLM_Inference)🔥\n* 📖[CPU\u002FNPU\u002FFPGA\u002FMobile Inference](#CPU-Single-GPU-Inference)\n* 📖[Non Transformer Architecture](#Non-Transformer-Architecture)🔥\n* 📖[GEMM\u002FTensor Cores\u002FWMMA\u002FParallel](#GEMM-Tensor-Cores-WMMA)\n* 📖[VLM\u002FPosition Embed\u002FOthers](#Others)\n* 📖[LLM Inference Applications](#LLM-Inference-Applications)\n\n### 📖Trending LLM\u002FVLM Topics ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Trending-LLM-VLM-Topics\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n| 2026.03 | 🔥🔥🔥[**OneComp**] OneComp: One-Line Revolution for Generative AI Model Compression(@Fujitsu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.28845) | [[OneCompression]](https:\u002F\u002Fgithub.com\u002FFujitsuResearch\u002FOneCompression) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFujitsuResearch\u002FOneCompression.svg?style=social) | ⭐️⭐️ |\n| 2025.12 | 🔥🔥[**QEP**] QEP: Quantization Error Propagation, NeurIPS 2025(@Fujitsu) | [[pdf]](https:\u002F\u002Fopenreview.net\u002Fpdf?id=a3l3K9khbL) | [[OneCompression]](https:\u002F\u002Fgithub.com\u002FFujitsuResearch\u002FOneCompression) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFujitsuResearch\u002FOneCompression.svg?style=social) | ⭐️⭐️ |\n|2024.04| 🔥🔥🔥[Open-Sora] Open-Sora: Democratizing Efficient Video Production for All(@hpcaitech)|[[docs]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora\u002Fblob\u002Fmain\u002Fdocs\u002Fzh_CN\u002FREADME.md) | [[Open-Sora]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhpcaitech\u002FOpen-Sora.svg?style=social)| ⭐️⭐️ |\n|2024.04| 🔥🔥🔥[Open-Sora Plan] Open-Sora Plan: This project aim to reproduce Sora (Open AI T2V model)(@PKU)|[[report]](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan\u002Fblob\u002Fmain\u002Fdocs\u002FReport-v1.0.0.md) | [[Open-Sora-Plan]](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FOpen-Sora-Plan.svg?style=social)| ⭐️⭐️ |\n|2024.05| 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04434) | [[DeepSeek-V2]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2.svg?style=social)| ⭐️⭐️ |\n|2024.05|🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05254) | [[unilm-YOCO]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002FYOCO) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v3.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**FlashAttention-3**] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) |[[pdf]](https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash3\u002Fflash3.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**MInference 1.0**] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02490)|[[MInference 1.0]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference.svg?style=social)|⭐️⭐️ |\n|2024.11|🔥🔥🔥[**Star-Attention: 11x~ speedup**] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17116)|[[Star-Attention]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FStar-Attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FStar-Attention.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥🔥[**DeepSeek-V3**] DeepSeek-V3 Technical Report(@deepseek-ai) | [[pdf]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3\u002Fblob\u002Fmain\u002FDeepSeek_V3.pdf) | [[DeepSeek-V3]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V3.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥 [**MiniMax-Text-01**] MiniMax-01: Scaling Foundation Models with Lightning Attention | [[report]](https:\u002F\u002Ffilecdn.minimax.chat\u002F_Arxiv_MiniMax_01_Report.pdf) | [[MiniMax-01]](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-01) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-01.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥[**DeepSeek-R1**] DeepSeek-R1 Technical Report(@deepseek-ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12948v1) | [[DeepSeek-R1]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1.svg?style=social) | ⭐️⭐️ |\n\n### 📖DeepSeek\u002FMulti-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"mla\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2024.05| 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04434) | [[DeepSeek-V2]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2.svg?style=social)| ⭐️⭐️ |\n|2024.12|🔥🔥🔥[**DeepSeek-V3**] DeepSeek-V3 Technical Report(@deepseek-ai) | [[pdf]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3\u002Fblob\u002Fmain\u002FDeepSeek_V3.pdf) | [[DeepSeek-V3]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V3.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥[**DeepSeek-R1**] DeepSeek-R1 Technical Report(@deepseek-ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12948v1) | [[DeepSeek-R1]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1.svg?style=social) | ⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepSeek-NSA**] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention(@deepseek-ai)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.11089)| ⚠️|⭐️⭐️ |\n|2025.02|🔥🔥🔥[**FlashMLA**] DeepSeek FlashMLA(@deepseek-ai)|⚠️| [[FlashMLA]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FFlashMLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FFlashMLA.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DualPipe**] DeepSeek DualPipe(@deepseek-ai)|⚠️| [[DualPipe]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDualPipe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDualPipe.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepEP**] DeepSeek DeepEP(@deepseek-ai)|⚠️| [[DeepEP]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepEP.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepGEMM**] DeepSeek DeepGEMM(@deepseek-ai)|⚠️| [[DeepGEMM]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepGEMM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepGEMM.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**EPLB**] DeepSeek EPLB(@deepseek-ai)|⚠️| [[EPLB]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FEPLB) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FEPLB.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**3FS**] DeepSeek 3FS(@deepseek-ai)|⚠️| [[3FS]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002F3FS) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002F3FS.svg?style=social) |⭐️⭐️ |\n|2025.03|🔥🔥🔥[**推理系统**] DeepSeek-V3 \u002F R1 推理系统概览 (@deepseek-ai) | [[blog]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F27181462601) | ⚠️|⭐️⭐️ |\n|2025.02|🔥🔥[**MHA2MLA**] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs(@fudan.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.14837)| [[MHA2MLA]](https:\u002F\u002Fgithub.com\u002FJT-Ushio\u002FMHA2MLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJT-Ushio\u002FMHA2MLA.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥[**TransMLA**] TransMLA: Multi-head Latent Attention Is All You Need(@PKU)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.07864)|[[TransMLA]](https:\u002F\u002Fgithub.com\u002Ffxmeng\u002FTransMLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffxmeng\u002FTransMLA.svg?style=social) | ⭐️⭐️ |\n|2025.03|🔥🔥[**X-EcoMLA**] X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression(@AMD)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.11132) |⚠️|⭐️⭐️ |\n\n### 📖Multi-GPUs\u002FMulti-Nodes Parallelism ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"DP-MP-PP-TP-SP-CP\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2019.10|🔥🔥[**MP: ZeRO**] DeepSpeed-ZeRO: Memory Optimizations Toward Training Trillion Parameter Models(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.02054)|  [[deepspeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2020.05|🔥🔥[**TP: Megatron-LM**] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1909.08053.pdf)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2022.05|🔥🔥[**SP: Megatron-LM**] Megatron-LM: Reducing Activation Recomputation in Large Transformer Models(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05198)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2023.05|🔥🔥[**SP: BPT**] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19370)| [[RingAttention]](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flhao499\u002FRingAttention.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥🔥[**SP: Ring Attention**] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01889.pdf)| [[RingAttention]](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flhao499\u002FRingAttention.svg?style=social)|⭐️⭐️ |\n|2023.11|🔥🔥[**SP: STRIPED ATTENTION**] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09431.pdf) |[[striped_attention]](https:\u002F\u002Fgithub.com\u002Fexists-forall\u002Fstriped_attention\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fexists-forall\u002Fstriped_attention.svg?style=social) |⭐️⭐️ |\n|2023.10|🔥🔥[**SP: DEEPSPEED ULYSSES**] DEEPSPEED ULYSSES: SYSTEM OPTIMIZATIONS FOR ENABLING TRAINING OF EXTREME LONG SEQUENCE TRANSFORMER MODELS(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14509)|  [[deepspeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥🔥[**CP: Megatron-LM**] Megatron-LM: Context parallelism overview(@NVIDIA)|[[docs]](https:\u002F\u002Fdocs.nvidia.com\u002Fmegatron-core\u002Fdeveloper-guide\u002Flatest\u002Fapi-guide\u002Fcontext_parallel.html)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2024.05|🔥🔥[**SP: Unified Sequence Parallel (USP)**] YunChang: A Unified Sequence Parallel (USP) Attention for Long Context LLM Model Training and Inference(@Tencent)|[[pdf]]()|[[long-context-attention]](https:\u002F\u002Fgithub.com\u002Ffeifeibear\u002Flong-context-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffeifeibear\u002Flong-context-attention.svg?style=social)|⭐️⭐️ |\n|2024.11|🔥🔥[**CP: Meta**] Context Parallelism for Scalable Million-Token Inference(@Meta Platforms, Inc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.01783)| ⚠️|⭐️⭐️ |\n|2024.11|🔥🔥[**TP: Comm Compression**] Communication Compression for Tensor Parallel LLM Inference(@recogni.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.09510)| ⚠️|⭐️⭐️ |\n|2024.11|🔥🔥🔥[**SP: Star-Attention, 11x~ speedup**] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17116)|[[Star-Attention]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FStar-Attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FStar-Attention.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥[**SP: TokenRing**] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication(@SJTU) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.20501)|[[token-ring]](https:\u002F\u002Fgithub.com\u002FACA-Lab-SJTU\u002Ftoken-ring) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FACA-Lab-SJTU\u002Ftoken-ring.svg?style=social)|⭐️⭐️ |\n|2025.05|🔥🔥[**FSDP 1\u002F2**] PyTorch FSDP: Getting Started with Fully Sharded Data Parallel(FSDP) (@pytorch) | [[docs]](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002FFSDP_tutorial.html#getting-started-with-fully-sharded-data-parallel-fsdp) | ⚠️ |⭐️⭐️ |\n\n\n### 📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"P-D-Disaggregating\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2024.01|🔥🔥[**DistServe**] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving(@PKU)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.09670)|[[DistServe]](https:\u002F\u002Fgithub.com\u002FLLMServe\u002FDistServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLMServe\u002FDistServe.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥🔥[**Mooncake**] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) |[[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥[**KVDirect**] KVDirect: Distributed Disaggregated LLM Inference(@ByteDance)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14743)|⚠️|⭐️ |\n|2025.01|🔥🔥[**DeServe**] DESERVE: TOWARDS AFFORDABLE OFFLINE LLM INFERENCE VIA DECENTRALIZATION(@Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14784)|⚠️|⭐️ |\n|2025.04|🔥🔥[**MegaScale-Infer**] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism(@ByteDance Seed) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.02263) |⚠️|⭐️ |\n\n### 📖LLM Algorithmic\u002FEval Survey ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"LLM-Algorithmic-Eval-Survey\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.10|[Evaluating] Evaluating Large Language Models: A Comprehensive Survey(@tju.edu.cn)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.19736.pdf)|[[Awesome-LLMs-Evaluation]](https:\u002F\u002Fgithub.com\u002Ftjunlp-lab\u002FAwesome-LLMs-Evaluation-Papers)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftjunlp-lab\u002FAwesome-LLMs-Evaluation-Papers.svg?style=social) |⭐️ |\n|2023.11|🔥[**Runtime Performance**] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models(@hkust-gz.edu.cn) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03687.pdf)|⚠️|⭐️⭐️ |\n|2023.11|[ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?(@e.ntu.edu.sg)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16989.pdf)|⚠️|⭐️ |\n|2023.12|[Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00678.pdf)|⚠️|⭐️ |\n|2023.12|[Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly(@Drexel University)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02003.pdf)|⚠️|⭐️ |\n|2023.12|🔥[**LLMCompass**] A Hardware Evaluation Framework for Large Language Model Inference(@princeton.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03134.pdf)|⚠️|⭐️⭐️ |\n|2023.12|🔥[**Efficient LLMs**] Efficient Large Language Models: A Survey(@Ohio State University etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03863.pdf)|[[Efficient-LLMs-Survey]](https:\u002F\u002Fgithub.com\u002FAIoT-MLSys-Lab\u002FEfficient-LLMs-Survey)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIoT-MLSys-Lab\u002FEfficient-LLMs-Survey.svg?style=social) |⭐️⭐️ |\n|2023.12|[**Serving Survey**] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems(@Carnegie Mellon University) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.15234.pdf)|⚠️|⭐️⭐️ |\n|2024.01|[Understanding LLMs] Understanding LLMs: A Comprehensive Overview from Training to Inference(@Shaanxi Normal University etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02038.pdf) | ⚠️|⭐️⭐️ |\n|2024.02|[LLM-Viewer] LLM Inference Unveiled: Survey and Roofline Model Insights(@Zhihang Yuan etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.16363.pdf)|[[LLM-Viewer]](https:\u002F\u002Fgithub.com\u002Fhahnyuan\u002FLLM-Viewer)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhahnyuan\u002FLLM-Viewer.svg?style=social) |⭐️⭐️ |\n|2024.07|[**Internal Consistency & Self-Feedback**] Internal Consistency and Self-Feedback in Large Language Models: A Survey|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.14507)| [[ICSF-Survey]](https:\u002F\u002Fgithub.com\u002FIAAR-Shanghai\u002FICSFSurvey)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIAAR-Shanghai\u002FICSFSurvey.svg?style=social) | ⭐️⭐️ |\n|2024.09|[**Low-bit**] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms(@Beihang etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16694) | ⚠️|⭐️⭐️ |\n|2024.10|[**LLM Inference**] LARGE LANGUAGE MODEL INFERENCE ACCELERATION: A COMPREHENSIVE HARDWARE PERSPECTIVE(@SJTU etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.04466) | ⚠️|⭐️⭐️ |\n\n### 📖LLM Train\u002FInference Framework\u002FDesign ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"LLM-Train-Inference-Framework\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2020.05|🔥[**Megatron-LM**] Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1909.08053.pdf)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2023.03|[FlexGen] High-Throughput Generative Inference of Large Language Models  with a Single GPU(@Stanford University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06865.pdf)|[[FlexGen]](https:\u002F\u002Fgithub.com\u002FFMInference\u002FFlexGen) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FFlexGen.svg?style=social)|⭐️ |\n|2023.05|[SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification(@Peking University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.09781.pdf)|[[FlexFlow]](https:\u002F\u002Fgithub.com\u002Fflexflow\u002FFlexFlow\u002Ftree\u002Finference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fflexflow\u002FFlexFlow.svg?style=social)|⭐️ |\n|2023.05|[FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.05920.pdf)|⚠️|⭐️ |\n|2023.09|🔥[**vLLM**] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.06180.pdf)|[[vllm]](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvllm-project\u002Fvllm.svg?style=social)|⭐️⭐️ |\n|2023.09|[StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.17453.pdf)|[[streaming-llm]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fstreaming-llm.svg?style=social)|⭐️ |\n|2023.09|[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc)|[[blog]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmedusa-llm)|[[Medusa]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FMedusa.svg?style=social)|⭐️ |\n|2023.10|🔥[**TensorRT-LLM**] NVIDIA TensorRT LLM(@NVIDIA) |[[docs]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002F)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.11|🔥[**DeepSpeed-FastGen 2x vLLM?**] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.08671.pdf) | [[deepspeed-fastgen]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥🔥[**SGLang**] Efficiently Programming Large Language Models using SGLang(@Stanford University etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07104)|[[sglang]](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsgl-project\u002Fsglang.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥[**PETALS**] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08361.pdf)|[[petals]](https:\u002F\u002Fgithub.com\u002Fbigscience-workshop\u002Fpetals) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbigscience-workshop\u002Fpetals.svg?style=social)|⭐️⭐️ |\n|2023.10|[LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03294.pdf)|[[LightSeq]](https:\u002F\u002Fgithub.com\u002FRulinShao\u002FLightSeq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRulinShao\u002FLightSeq.svg?style=social)|⭐️ |\n|2023.12|[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)|[[pdf]](https:\u002F\u002Fipads.se.sjtu.edu.cn\u002F_media\u002Fpublications\u002Fpowerinfer-20231219.pdf)|[[PowerInfer]](https:\u002F\u002Fgithub.com\u002FSJTU-IPADS\u002FPowerInfer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSJTU-IPADS\u002FPowerInfer.svg?style=social)|⭐️ |\n|2024.01|[inferflow]INFERFLOW: AN EFFICIENT AND HIGHLY CONFIGURABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS(@Tencent AI Lab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.08294.pdf) | [[inferflow]](https:\u002F\u002Fgithub.com\u002Finferflow\u002Finferflow) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finferflow\u002Finferflow.svg?style=social)|⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2023.06|🔥[**LMDeploy**] LMDeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs(@InternLM) |[[docs]](https:\u002F\u002Flmdeploy.readthedocs.io\u002Fen\u002Flatest\u002F) | [[lmdeploy]](https:\u002F\u002Fgithub.com\u002FInternLM\u002Flmdeploy) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002Flmdeploy.svg?style=social)|⭐️⭐️ |\n|2023.05|🔥[**MLC-LLM**]Universal LLM Deployment Engine with ML Compilation(@mlc-ai) | [[docs]](https:\u002F\u002Fllm.mlc.ai\u002F) | [[mlc-llm]](https:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fmlc-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlc-ai\u002Fmlc-llm.svg?style=social)|⭐️⭐️ |\n|2023.08|🔥[**LightLLM**] LightLLM is a Python-based LLM (Large Language Model) inference and serving framework(@ModelTC) | [[docs]](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm) | [[lightllm]](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FModelTC\u002Flightllm.svg?style=social)|⭐️⭐️ |\n|2023.03|🔥[**llama.cpp**] llama.cpp: Inference of Meta's LLaMA model (and others) in pure C\u002FC++(@ggerganov) |[[docs]](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) | [[llama.cpp]](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fggerganov\u002Fllama.cpp.svg?style=social)|⭐️⭐️ |\n|2024.02|🔥[**flashinfer**] FlashInfer: Kernel Library for LLM Serving(@flashinfer-ai) |[[docs]](https:\u002F\u002Fflashinfer.ai\u002F2024\u002F02\u002F02\u002Fcascade-inference.html)|[[flashinfer]](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fflashinfer-ai\u002Fflashinfer.svg?style=social)|⭐️⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥[DynamoLLM] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency(@Microsoft Azure Research)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.00741)|⚠️|⭐️ |\n|2024.08|🔥[NanoFlow] NanoFlow: Towards Optimal Large Language Model Serving Throughput(@University of Washington)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.12757)|[[Nanoflow]](https:\u002F\u002Fgithub.com\u002Fefeslab\u002FNanoflow) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fefeslab\u002FNanoflow.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[**Decentralized LLM**] Decentralized LLM Inference over Edge Networks with Energy Harvesting(@Padova)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15907)|⚠️|⭐️ |\n|2024.11| 🔥[**SparseInfer**] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference(@University of Seoul, etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.12692)|⚠️|⭐️ |\n|2025.04|🔥[prima.cpp] PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters(@MBZUAI, etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.08791)|[[prima.cpp]](https:\u002F\u002Fgithub.com\u002FLizonghang\u002Fprima.cpp) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLizonghang\u002Fprima.cpp.svg?style=social)|⭐️|\n|2025.07|🔥[**siiRL**] DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training(@Shanghai Inovation Institute)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.13833)|[[siiRL]](https:\u002F\u002Fgithub.com\u002Fsii-research\u002FsiiRL)\u003Cbr> ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsii-research\u002FsiiRL.svg?style=social)|⭐️⭐️ | \n|2025.04|🔥[**ToolPipe**] ToolPipe: 120+ Free Developer Tools REST API & MCP Server for AI Agents(@COSAI-Labs)|[[docs]](https:\u002F\u002Ftoolpipe.dev)|[[toolpipe-mcp-server]](https:\u002F\u002Fgithub.com\u002FCOSAI-Labs\u002Ftoolpipe-mcp-server) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCOSAI-Labs\u002Ftoolpipe-mcp-server.svg?style=social)|⭐️ |\n\n### 📖Continuous\u002FIn-flight Batching  ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Continuous-In-flight-Batching\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2022.07|🔥[**Continuous Batching**] Orca: A Distributed Serving System for Transformer-Based Generative Models(@Seoul National University etc) |[[pdf]](https:\u002F\u002Fwww.usenix.org\u002Fsystem\u002Ffiles\u002Fosdi22-yu.pdf)|⚠️|⭐️⭐️ |\n|2023.10|🔥[**In-flight Batching**] NVIDIA TensorRT LLM Batch Manager(@NVIDIA) |[[docs]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002Fbatch_manager.html)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.11|🔥[**DeepSpeed-FastGen 2x vLLM?**] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft)| [[blog]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Ftree\u002Fmaster\u002Fblogs\u002Fdeepspeed-fastgen) | [[deepspeed-fastgen]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2023.11|[Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting(@Microsoft etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18677.pdf)|⚠️ |⭐️ |\n|2023.12|[SpotServe] SpotServe: Serving Generative Large Language Models on Preemptible Instances(@cmu.edu etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.15566.pdf)|[[SpotServe]](https:\u002F\u002Fgithub.com\u002FHsword\u002FSpotServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHsword\u002FSpotServe.svg?style=social)|⭐️ |\n|2023.10|[LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03294.pdf)|[[LightSeq]](https:\u002F\u002Fgithub.com\u002FRulinShao\u002FLightSeq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRulinShao\u002FLightSeq.svg?style=social)|⭐️ |\n|2024.05|🔥[vAttention] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention(@Microsoft Research India)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04437)|[[vAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fvattention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fvattention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**vTensor**] vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving(@Shanghai Jiao Tong University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.15309)|[[vTensor]](https:\u002F\u002Fgithub.com\u002Fintelligent-machine-learning\u002Fglake\u002Ftree\u002Fmaster\u002FGLakeServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fintelligent-machine-learning\u002Fglake.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[Automatic Inference Engine Tuning] Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning(@Nanjing University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04323)|⚠️|⭐️⭐️ |\n|2024.08|🔥[**SJF Scheduling**] Efficient LLM Scheduling by Learning to Rank(@UCSD etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15792)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**BatchLLM**] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03594)|⚠️|⭐️⭐️ |\n\n### 📖Weight\u002FActivation Quantize\u002FCompress ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Weight-Activation-Quantize-Compress\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2022.06|🔥[**ZeroQuant**] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers(@Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01861.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️⭐️ |\n|2022.08|[FP8-Quantization] FP8 Quantization: The Power of the Exponent(@Qualcomm AI Research) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09225.pdf) | [[FP8-quantization]](https:\u002F\u002Fgithub.com\u002FQualcomm-AI-research\u002FFP8-quantization) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQualcomm-AI-research\u002FFP8-quantization.svg?style=social) |⭐️ |\n|2022.08|[LLM.int8()] 8-bit Matrix Multiplication  for Transformers at Scale(@Facebook AI Research etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07339.pdf)|[[bitsandbytes]](https:\u002F\u002Fgithub.com\u002Ftimdettmers\u002Fbitsandbytes) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftimdettmers\u002Fbitsandbytes.svg?style=social)|⭐️ |\n|2022.10|🔥[**GPTQ**] GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS(@IST Austria etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17323.pdf) |[[gptq]](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fgptq.svg?style=social)|⭐️⭐️ |\n|2022.11|🔥[**WINT8\u002F4**] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10017.pdf)|[[FasterTransformer]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FFasterTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FFasterTransformer.svg?style=social)|⭐️⭐️ |\n|2022.11|🔥[**SmoothQuant**] Accurate and Efficient Post-Training Quantization for Large Language Models(@MIT etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10438.pdf)|[[smoothquant]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fsmoothquant.svg?style=social)|⭐️⭐️ |\n|2023.03|[ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.08302.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️ |\n|2023.06|🔥[**AWQ**] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration(@MIT etc)|[[pdf]](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2306.00978.pdf)|[[llm-awq]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fllm-awq.svg?style=social)|⭐️⭐️ |\n|2023.06|[SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression(@University of Washington etc)|[[pdf]](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2306.03078.pdf)|[[SpQR]](https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVahe1994\u002FSpQR.svg?style=social)|⭐️ |\n|2023.06|[SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION(@berkeley.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.07629.pdf) | [[SqueezeLLM]](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FSqueezeLLM.svg?style=social) |⭐️ |\n|2023.07|[ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.09782.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️ |\n|2023.09|[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) | [[blog]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F653735572)|⚠️|⭐️ |\n|2023.10|[FP8-LM] FP8-LM: Training FP8 Large Language Models(@Microsoft etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.18313.pdf)| [[MS-AMP]](https:\u002F\u002Fgithub.com\u002FAzure\u002FMS-AMP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAzure\u002FMS-AMP.svg?style=social) |⭐️ |\n|2023.10|[LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING(@cs.princeton.edu etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.06694.pdf) | [[LLM-Shearing]](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FLLM-Shearing.svg?style=social)  |⭐️ |\n|2023.10|[LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers(@ust.hk&meta etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16836.pdf) | [[LLM-FP4]](https:\u002F\u002Fgithub.com\u002Fnbasyl\u002FLLM-FP4) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnbasyl\u002FLLM-FP4.svg?style=social) |⭐️ |\n|2023.11|[2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization(@Shanghai Jiao Tong University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16442.pdf)|⚠️ |⭐️ |\n|2023.12|[**SmoothQuant+**] SmoothQuant+: Accurate and Efficient 4-bit Post-Training Weight Quantization for LLM(@ZTE Corporation)  | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03788.pdf) | [[smoothquantplus]](https:\u002F\u002Fgithub.com\u002FAdlik\u002Fsmoothquantplus) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAdlik\u002Fsmoothquantplus.svg?style=social) |⭐️ |\n|2023.11|[OdysseyLLM W4A8] A Speed Odyssey for Deployable Quantization of LLMs(@meituan.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09550.pdf)|⚠️|⭐️ |\n|2023.12|🔥[**SparQ**] SPARQ ATTENTION: BANDWIDTH-EFFICIENT LLM INFERENCE(@graphcore.ai)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04985.pdf)|⚠️|⭐️⭐️ |\n|2023.12|[Agile-Quant] Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge(@Northeastern University&Oracle)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.05693.pdf)|⚠️|⭐️ |\n|2023.12|[CBQ] CBQ: Cross-Block Quantization for Large Language Models(@ustc.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07950.pdf)|⚠️|⭐️ |\n|2023.10|[QLLM] QLLM: ACCURATE AND EFFICIENT LOW-BITWIDTH QUANTIZATION FOR LARGE LANGUAGE MODELS(@ZIP Lab&SenseTime Research etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.08041.pdf)|⚠️|⭐️ |\n|2024.01|[FP6-LLM] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design(@Microsoft etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.14112.pdf)|⚠️|⭐️ |\n|2024.05|🔥🔥[**W4A8KV4**] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving(@MIT&NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04532)|[[qserve]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fqserve) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fqserve.svg?style=social) |⭐️⭐️ |\n|2024.05|🔥[SpinQuant] SpinQuant: LLM Quantization with Learned Rotations(@Meta)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.16406)|⚠️|⭐️ |\n|2024.05|🔥[I-LLM] I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models(@Houmo AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.17849)|⚠️|⭐️ |\n|2024.06|🔥[OutlierTune] OutlierTune: Efficient Channel-Wise Quantization for Large Language Models(@Beijing University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18832)|⚠️|⭐️ |\n|2024.06|🔥[GPTQT] GPTQT: Quantize Large Language Models Twice to Push the Efficiency(@zju)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02891)|⚠️|⭐️ |\n|2024.08|🔥[ABQ-LLM] ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models(@ByteDance)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08554)|[[ABQ-LLM]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FABQ-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FABQ-LLM.svg?style=social)|⭐️ |\n|2024.08|🔥[1-bit LLMs] Matmul or No Matmal in the Era of 1-bit LLMs(@University of South Carolina)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11939)|⚠️|⭐️ |\n|2024.08|🔥[ACTIVATION SPARSITY] TRAINING-FREE ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS(@MIT etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.14690)|[[TEAL]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FTEAL) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FTEAL.svg?style=social)|⭐️ |\n|2024.09|🔥[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.17066)|[[VPTQ]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVPTQ) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FVPTQ.svg?style=social)|⭐️ |\n|2024.11|🔥[BitNet] BitNet a4.8: 4-bit Activations for 1-bit LLMs(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.04965)|[[bitnet]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fbitnet) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social)|⭐️ |\n|2025.04|🔥[**BitNet v2**] BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.18415)|[[bitnet]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fbitnet) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social)|⭐️ |\n|2025.05|🔥[**GuidedQuant**] GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance (@SNU&SamsungAILab&Google) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.07004) |[[GuidedQuant]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FGuidedQuant) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FGuidedQuant.svg?style=social)|⭐️⭐️ |\n\n### 📖IO\u002FFLOPs-Aware\u002FSparse Attention ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"IO-FLOPs-Aware-Attention-Sparse\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2018.05| [Online Softmax] Online normalizer calculation for softmax(@NVIDIA) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1805.02867.pdf)|⚠️|⭐️ |\n|2019.11|🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.02150.pdf)|⚠️|⭐️⭐️ |\n|2020.10|[Hash Attention] REFORMER: THE EFFICIENT TRANSFORMER(@Google)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2001.04451.pdf)|[[reformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Ftrax\u002Ftree\u002Fmaster\u002Ftrax\u002Fmodels\u002Freformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Ftrax.svg?style=social)|⭐️⭐️ |\n|2022.05|🔥[**FlashAttention**] Fast and Memory-Efficient Exact Attention with IO-Awareness(@Stanford University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14135.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2022.10|[Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY(@Google)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05682.pdf) | ⚠️ |⭐️ |\n|2023.05|[FlashAttention] From Online Softmax to FlashAttention(@cs.washington.edu)|[[pdf]](https:\u002F\u002Fcourses.cs.washington.edu\u002Fcourses\u002Fcse599m\u002F23sp\u002Fnotes\u002Fflashattn.pdf)|⚠️|⭐️⭐️ |\n|2023.05|[FLOP, I\u002FO] Dissecting Batching Effects in GPT Inference(@Lequn Chen) | [[blog]](https:\u002F\u002Fle.qun.ch\u002Fen\u002Fblog\u002F2023\u002F05\u002F13\u002Ftransformer-batching\u002F) | ⚠️ |⭐️ |\n|2023.05|🔥🔥[**GQA**] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13245.pdf)|[[flaxformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fflaxformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fflaxformer.svg?style=social) |⭐️⭐️ |\n|2023.06|[Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01160.pdf) | [[dynamic-sparse-flash-attention]](https:\u002F\u002Fgithub.com\u002Fepfml\u002Fdynamic-sparse-flash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fepfml\u002Fdynamic-sparse-flash-attention.svg?style=social)|⭐️ |\n|2023.07|🔥[**FlashAttention-2**] Faster Attention with Better Parallelism and Work Partitioning(@Stanford University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08691.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥[**Flash-Decoding**] Flash-Decoding for long-context inference(@Stanford University etc)|[[blog]](https:\u002F\u002Fcrfm.stanford.edu\u002F2023\u002F10\u002F12\u002Fflashdecoding.html)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2023.11|[Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS(@Tsinghua University&Infinigence-AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01282.pdf) | ⚠️ |⭐️ |\n|2023.01|[SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot(@ISTA etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.00774.pdf)| [[sparsegpt]](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fsparsegpt.svg?style=social) |⭐️ |\n|2023.12|🔥[**GLA**] Gated Linear Attention Transformers with Hardware-Efficient Training(@MIT-IBM Watson AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06635.pdf)|[gated_linear_attention](https:\u002F\u002Fgithub.com\u002Fberlino\u002Fgated_linear_attention)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fberlino\u002Fgated_linear_attention.svg?style=social)|⭐️⭐️ |\n|2023.12|[SCCA] SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion(@Beihang University)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07305.pdf) | ⚠️ |⭐️ |\n|2023.12|🔥[**FlashLLM**] LLM in a flash: Efficient Large Language Model Inference with Limited Memory(@Apple)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11514.pdf) | ⚠️ |⭐️⭐️ |\n|2024.03|🔥🔥[CHAI] CHAI: Clustered Head Attention for Efficient LLM Inference(@cs.wisc.edu etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.08058.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[DeFT] DeFT: Decoding with Flash Tree-Attention for Efficient Tree-structured LLM Inference(@Westlake University etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.00242) | ⚠️ |⭐️⭐️ |\n|2024.04|[MoA] MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression(@thu et el.)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.14909) | [[MoA]](https:\u002F\u002Fgithub.com\u002Fthu-nics\u002FMoA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-nics\u002FMoA.svg?style=social) | ⭐️ |\n|2024.07|🔥🔥[**FlashAttention-3**] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) |[[pdf]](https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash3\u002Fflash3.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**MInference 1.0**] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02490)|[[MInference 1.0]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[Shared Attention] Beyond KV Caching: Shared Attention for Efficient LLMs(@Kyushu University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.12866) | [[shareAtt]](https:\u002F\u002Fgithub.com\u002Fmetacarbon\u002FshareAtt) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmetacarbon\u002FshareAtt.svg?style=social) | ⭐️ |\n|2024.09|🔥🔥[**CHESS**] CHESS : Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification(@Wuhan University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.01366) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION(@PKU etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16997)| [[INT-FlashAttention]](https:\u002F\u002Fgithub.com\u002FINT-FlashAttention2024\u002FINT-FlashAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FINT-FlashAttention2024\u002FINT-FlashAttention.svg?style=social) | ⭐️ |\n|2024.10|🔥🔥[**SageAttention**] SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.02367)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2024.11|🔥🔥[**SageAttention-2**] SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.10958)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2024.11|🔥🔥[**Squeezed Attention**] SQUEEZED ATTENTION: Accelerating Long Context Length LLM Inference(@UC Berkeley) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.09688)|[[SqueezedAttention]](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezedAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FSqueezedAttention) | ⭐️⭐️ |\n|2024.12|🔥🔥[**TurboAttention**] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.08585)| ⚠️ |⭐️⭐️ |\n|2025.01|🔥🔥[**FFPA**] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@xlite-dev)|[[docs]](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)| [[ffpa-attn]](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️ |\n|2025.03|🔥🔥[**SpargeAttention**] SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18137)|[[SpargeAttn]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSpargeAttn) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSpargeAttn) | ⭐️⭐️ |\n|2025.04|🔥🔥[**MMInference**] MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention(@microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.16083)|[[MInference]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Sparse Frontier**] The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs (@Cohere) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.17768)|[[SparseFrontier]](https:\u002F\u002Fgithub.com\u002FPiotrNawrot\u002Fsparse-frontier) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPiotrNawrot\u002Fsparse-frontier) | ⭐️⭐️ |\n|2024.12|🔥🔥[**Flex Attention**] FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS(@pytorch) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05496)|[[attention-gym]](https:\u002F\u002Fgithub.com\u002Fpytorch-labs\u002Fattention-gym) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpytorch-labs\u002Fattention-gym) | ⭐️⭐️ |\n|2025.02| 🔥🔥🔥[**SeerAttention**] SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs(@microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13276) | [[SeerAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSeerAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FSeerAttention.svg?style=social) | ⭐️⭐️⭐️ |\n|2025.03| [**Slim attention**] Slim attention: cut your context memory in half without loss of accuracy, K-cache is all you need for MHA(@OpenMachine.ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.05840) | [[OpenMchine]](https:\u002F\u002Fgithub.com\u002FOpenMachine-ai\u002Ftransformer-tricks) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMachine-ai\u002Ftransformer-tricks.svg?style=social) | ⭐️⭐️⭐️ |\n|2025.05|🔥🔥[**SageAttention-3**] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.11594)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Parallel Encoding**] APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding(@cmu.edu&NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05431)|[[APE]](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FAPE) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FAPE) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Parallel Encoding**] Block-Attention for Efficient Prefilling(@Tencent etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.15355)|[[Block-attention]](https:\u002F\u002Fgithub.com\u002FTemporaryLoRA\u002FBlock-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTemporaryLoRA\u002FBlock-attention) | ⭐️⭐️ |\n\n### 📖KV Cache Scheduling\u002FQuantize\u002FDropping ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"KV-Cache-Scheduling-Quantize-Dropping\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2026.03|🔥🔥[**NexusQuant**] NexusQuant: Training-Free KV Cache Compression via E8 Lattice Quantization and Temporal Predictive Coding — 7x compression, -2.26% PPL on Mistral-7B, drop-in one-liner| [[code]](https:\u002F\u002Fgithub.com\u002Fnexusquant\u002Fnexusquant)|[[nexusquant]](https:\u002F\u002Fgithub.com\u002Fnexusquant\u002Fnexusquant) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnexusquant\u002Fnexusquant.svg?style=social)|⭐️⭐️ |\n|2019.11|🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.02150.pdf)|⚠️|⭐️⭐️ |\n|2022.06|[LTP] Learned Token Pruning for Transformers(@UC Berkeley etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.00910.pdf)|[[LTP]](https:\u002F\u002Fgithub.com\u002Fkssteven418\u002FLTP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkssteven418\u002FLTP.svg?style=social)|⭐️ |\n|2023.05|🔥🔥[**GQA**] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13245.pdf)|[[flaxformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fflaxformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fflaxformer.svg?style=social) |⭐️⭐️ |\n|2023.05|[KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time(@)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.17118.pdf)|⚠️|⭐️⭐️ |\n|2023.06|[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(@Rice University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14048.pdf)|[[H2O]](https:\u002F\u002Fgithub.com\u002FFMInference\u002FH2O) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FH2O.svg?style=social) |⭐️ |\n|2023.06|[QK-Sparse\u002FDropping Attention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01160.pdf) | [[dynamic-sparse-flash-attention]](https:\u002F\u002Fgithub.com\u002Fepfml\u002Fdynamic-sparse-flash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fepfml\u002Fdynamic-sparse-flash-attention.svg?style=social)|⭐️ |\n|2023.08|🔥🔥[Chunked Prefills] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills(@Microsoft etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16369.pdf)|⚠️|⭐️⭐️ |\n|2023.09|🔥🔥[**PagedAttention**] Efficient Memory Management for Large Language  Model Serving with PagedAttention(@UC Berkeley etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.06180.pdf)|[[vllm]](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvllm-project\u002Fvllm.svg?style=social)|⭐️⭐️ |\n|2023.09|[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) | [[blog]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F653735572)|⚠️|⭐️ |\n|2023.10|🔥[**TensorRT-LLM KV Cache FP8**] NVIDIA TensorRT LLM(@NVIDIA) |[[docs]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002Fprecision.html)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.10|🔥[**Adaptive KV Cache Compress**] MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS(@illinois.edu&microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01801.pdf)|⚠️|⭐️⭐️ |\n|2023.10|[CacheGen] CacheGen: Fast Context Loading for Language Model Applications(@Chicago University&Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.07240.pdf)|[[LMCache]](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache.svg?style=social)|⭐️ |\n|2023.12|[KV-Cache Optimizations] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04951.pdf)|⚠️|⭐️ |\n|2023.12|[KV Cache Compress with LoRA] Compressed Context Memory for Online Language Model Interaction (@SNU & NAVER AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03414.pdf)|[[Compressed-Context-Memory]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FContext-Memory) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FContext-Memory.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥🔥[**RadixAttention**] Efficiently Programming Large Language Models using SGLang(@Stanford University etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07104)|[[sglang]](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsgl-project\u002Fsglang.svg?style=social) |⭐️⭐️ |\n|2024.01|🔥🔥[**DistKV-LLM**] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache(@Alibaba etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02669.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[Prompt Caching] Efficient Prompt Caching via Embedding Similarity(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01173.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[Less] Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference(@CMU etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.09398.pdf)|⚠️|⭐️ |\n|2024.02|🔥🔥[MiKV] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization(@KAIST)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.18096.pdf)|⚠️|⭐️ |\n|2024.02|🔥🔥[**Shared Prefixes**] Hydragen: High-Throughput LLM Inference with Shared Prefixes | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.05099.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[**ChunkAttention**] ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.15220)|[[chunk-attention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fchunk-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fchunk-attention.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥[QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache(@@smail.nju.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04643.pdf)|[[QAQ-KVCacheQuantization]](https:\u002F\u002Fgithub.com\u002FClubieDong\u002FQAQ-KVCacheQuantization) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FClubieDong\u002FQAQ-KVCacheQuantization.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥🔥[DMC] Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference(@NVIDIA etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09636.pdf)|⚠️|⭐️⭐️ |\n|2024.03|🔥🔥[Keyformer] Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference(@ece.ubc.ca etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09054.pdf)|[[Keyformer]](https:\u002F\u002Fgithub.com\u002Fd-matrix-ai\u002Fkeyformer-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fd-matrix-ai\u002Fkeyformer-llm.svg?style=social)|⭐️⭐️ |\n|2024.03|[FASTDECODE] FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous(@Tsinghua University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.11421.pdf)|⚠️|⭐️⭐️ |\n|2024.03|[Sparsity-Aware KV Caching] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching(@ucf.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.17312.pdf)|⚠️|⭐️⭐️ |\n|2024.03|🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM(@gatech.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05527)|[[GEAR]](https:\u002F\u002Fgithub.com\u002Fopengear-project\u002FGEAR) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopengear-project\u002FGEAR.svg?style=social)|⭐️ |\n|2024.04|[SqueezeAttention] SQUEEZEATTENTION: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget(@lzu.edu.cn etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.04793.pdf)|[[SqueezeAttention]](https:\u002F\u002Fgithub.com\u002Fhetailang\u002FSqueezeAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhetailang\u002FSqueezeAttention.svg?style=social) |⭐️⭐️ |\n|2024.04|[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation(@UIUC)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14469)|[[SnapKV]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FSnapKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FSnapKV.svg?style=social)|⭐️ |\n|2024.05|🔥[vAttention] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention(@Microsoft Research India)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04437)|[[vAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fvattention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fvattention.svg?style=social)|⭐️⭐️ |\n|2024.05|🔥[KVCache-1Bit] KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization(@Rice University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.03917)|⚠️|⭐️⭐️ |\n|2024.05|🔥[KV-Runahead] KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation(@Apple etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05329)|⚠️|⭐️⭐️ |\n|2024.05|🔥[ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification(@Zhejiang University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14256)|⚠️|⭐️⭐️ |\n|2024.05|🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models(@ZIP Lab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14366)|⚠️|⭐️⭐️ |\n|2024.05|🔥[CacheBlend] CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion(@University of Chicago)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.16444)|[[LMCache]](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache.svg?style=social)|⭐️⭐️ |\n|2024.06|🔥[CompressKV] Effectively Compress KV Heads for LLM(@alibaba etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.07056)|⚠️|⭐️⭐️ |\n|2024.06|🔥[MemServe] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool(@Huawei Cloud etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.17565)|⚠️|⭐️⭐️ |\n|2024.07|🔥[MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding(@Institut Teknologi Bandung)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09297)|[[pythia-mlkv]](https:\u002F\u002Fgithub.com\u002Fzaydzuhri\u002Fpythia-mlkv) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzaydzuhri\u002Fpythia-mlkv.svg?style=social)|⭐️ |\n|2024.07|🔥[ThinK] ThinK: Thinner Key Cache by Query-Driven Pruning(@Salesforce AI Research etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21018)|⚠️|⭐️⭐️ |\n|2024.07|🔥[Palu] Palu: Compressing KV-Cache with Low-Rank Projection(@nycu.edu.tw)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21118)|[[Palu]](https:\u002F\u002Fgithub.com\u002Fshadowpa0327\u002FPalu) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshadowpa0327\u002FPalu.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[Zero-Delay QKV Compression] Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference(@University of Virginia)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04107)|⚠️|⭐️⭐️ |\n|2024.09|🔥[**AlignedKV**] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization(@Tsinghua University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16546)|[[AlignedKV]](https:\u002F\u002Fgithub.com\u002FAlignedQuant\u002FAlignedKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlignedQuant\u002FAlignedKV.svg?style=social)|⭐️ |\n|2024.10|🔥[**LayerKV**] Optimizing Large Language Model Serving with Layer-wise KV Cache Management(@Ant Group)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.00428)|⚠️|⭐️⭐️ |\n|2024.10|🔥[**AdaKV**] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference (@USTC)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550)|[[AdaKV]](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFFY0\u002FAdaKV.svg?style=social&label=Star)|⭐️⭐️|\n|2024.11|🔥[**KV Cache Recomputation**] Efficient LLM Inference with I\u002FO-Aware Partial KV Cache Recomputation(@University of Southern California)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17089)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**ClusterKV**] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression(@sjtu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03213)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**DynamicKV**] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs(@xiabinzhou0625 etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.14838)|⚠️|⭐️⭐️ |\n|2025.02|🔥[**DynamicLLaVA**] [ICLR2025] Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification (@ECNU, Xiaohongshu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.00876)|[[DynamicLLaVA]](https:\u002F\u002Fgithub.com\u002FOsilly\u002Fdynamic_llava) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002Fdynamic_llava.svg?style=social&label=Star)|⭐️⭐️|\n|2025.02|🔥[**CacheCraft**] Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation(@Adobe Research)|[[pdf]](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.15734)|⚠️|⭐️⭐️ |\n|2025.04|🔥[**KV Cache Prefetch**] Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching(@Alibaba)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.06319)|⚠️|⭐️⭐️ |\n|2025.05|🔥[**KVzip**] KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction (@SNU)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416)|[[KVzip]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FKVzip.svg?style=social&label=Star)|⭐️⭐️|\n|2025.06|🔥🔥[**Inference-Time Hyper-Scaling**] Inference-Time Hyper-Scaling with KV Cache Compression (@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.05345)|⚠️|⭐️⭐️ |\n|2026.03|[**AVP**] Agent Vector Protocol: Cross-Model KV-Cache Transfer via Vocabulary-Mediated Projection (@VectorArc)|[[spec]](https:\u002F\u002Fgithub.com\u002FVectorArc\u002Favp-spec)|[[avp-python]](https:\u002F\u002Fgithub.com\u002FVectorArc\u002Favp-python) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVectorArc\u002Favp-python.svg?style=social)|⭐️⭐️ |\n\n### 📖Prompt\u002FContext\u002FKV Compression ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Context-Compression\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.04|🔥[**Selective-Context**] Compressing Context to Enhance Inference Efficiency of Large Language Models(@Surrey) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.06201.pdf)|[Selective-Context](https:\u002F\u002Fgithub.com\u002Fliyucheng09\u002FSelective_Context)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliyucheng09\u002FSelective_Context.svg?style=social)|⭐️⭐️ |\n|2023.05|[**AutoCompressor**] Adapting Language Models to Compress Contextss(@Princeton) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14788.pdf)|[AutoCompressor](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FAutoCompressors)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FAutoCompressors.svg?style=social)|⭐️ |\n|2023.10|🔥[**LLMLingua**] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05736.pdf)|[LLMLingua](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥🔥[**LongLLMLingua**] LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839","Awesome-LLM-Inference 是一个精心整理的大规模语言模型（LLM）和视觉语言模型（VLM）推理相关论文及代码的资源库。该项目涵盖了Flash-Attention、Paged-Attention、WINT8\u002F4量化以及并行计算等前沿技术，旨在为研究人员和开发者提供最新的高效推理方法和技术实现。其核心功能包括对多种优化技术的支持，如内存管理和加速策略，从而显著提升模型在不同硬件平台上的推理性能。适合于需要优化大规模预训练模型推理效率的各种应用场景，例如自然语言处理服务、图像识别系统等。",2,"2026-06-11 03:36:25","high_star"]