[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71152":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":37,"readmeContent":38,"aiSummary":39,"trendingCount":16,"starSnapshotCount":16,"syncStatus":40,"lastSyncTime":41,"discoverSource":42},71152,"flashinfer","flashinfer-ai\u002Fflashinfer","flashinfer-ai","FlashInfer: Kernel Library for LLM Serving","https:\u002F\u002Fflashinfer.ai",null,"Python",5774,1036,50,378,0,22,62,183,66,115.05,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36],"attention","cuda","distributed-inference","gpu","jit","large-large-models","llm-inference","moe","nvidia","pytorch","2026-06-12 04:00:59","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fweb-data\u002Fblob\u002Fmain\u002Flogo\u002FFlashInfer-black-background.png?raw=true\">\n    \u003Cimg alt=\"FlashInfer\" src=\"https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fweb-data\u002Fblob\u002Fmain\u002Flogo\u002FFlashInfer-white-background.png?raw=true\" width=55%>\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\u003Ch1 align=\"center\">\nHigh-Performance GPU Kernels for Inference\n\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n| \u003Ca href=\"https:\u002F\u002Fdocs.flashinfer.ai\">\u003Cb>Documentation\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Flatest\">\u003Cb>Latest Release\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fflashinfer.ai\">\u003Cb>Blog\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fflashinfer\u002Fshared_invite\u002Fzt-379wct3hc-D5jR~1ZKQcU00WHsXhgvtA\">\u003Cb>Slack\u003C\u002Fb>\u003C\u002Fa> |  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Forgs\u002Fflashinfer-ai\u002Fdiscussions\">\u003Cb>Discussion Forum\u003C\u002Fb>\u003C\u002Fa> |\n\u003C\u002Fp>\n\n[![Build Status](https:\u002F\u002Fci.tlcpack.ai\u002Fjob\u002Fflashinfer-ci\u002Fjob\u002Fmain\u002Fbadge\u002Ficon)](https:\u002F\u002Fci.tlcpack.ai\u002Fjob\u002Fflashinfer-ci\u002Fjob\u002Fmain\u002F)\n[![Documentation](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Factions\u002Fworkflows\u002Fbuild-doc.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Factions\u002Fworkflows\u002Fbuild-doc.yml)\n\n**FlashInfer** is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2\u002F3, cuDNN, CUTLASS, and TensorRT-LLM.\n\n## Why FlashInfer?\n\n- **State-of-the-art Performance**: Optimized kernels for prefill, decode, and mixed batching scenarios\n- **Multiple Backends**: Automatically selects the best backend for your hardware and workload\n- **Modern Architecture Support**: Support for SM75 (Turing) and later (through Blackwell)\n- **Low-Precision Compute**: FP8 and FP4 quantization for attention, GEMM, and MoE operations\n- **Production-Ready**: CUDAGraph and torch.compile compatible for low-latency serving\n\n## Core Features\n\n### Attention Kernels\n- **Paged and Ragged KV-Cache**: Efficient memory management for dynamic batch serving\n- **Decode, Prefill, and Append**: Optimized kernels for all attention phases\n- **MLA Attention**: Native support for DeepSeek's Multi-Latent Attention\n- **Cascade Attention**: Memory-efficient hierarchical KV-Cache for shared prefixes\n- **Sparse Attention**: Block-sparse and variable block-sparse patterns\n- **POD-Attention**: Fused prefill+decode for mixed batching\n\n### GEMM & Linear Operations\n- **BF16 GEMM**: BF16 matrix multiplication for SM10.0+ GPUs.\n- **FP8 GEMM**: Per-tensor and groupwise scaling\n- **FP4 GEMM**: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs\n- **Grouped GEMM**: Efficient batched matrix operations for LoRA and multi-expert routing\n\n### Mixture of Experts (MoE)\n- **Fused MoE Kernels**\n- **Multiple Routing Methods**: DeepSeek-V3, Llama-4, and standard top-k routing\n- **Quantized MoE**: FP8 and FP4 expert weights with block-wise scaling\n\n### Sampling & Decoding\n- **Sorting-Free Sampling**: Efficient Top-K, Top-P, and Min-P without sorting\n- **Speculative Decoding**: Chain speculative sampling support\n\n### Communication\n- **AllReduce**: Custom implementations\n- **Multi-Node NVLink**: MNNVL support for multi-node inference\n- **NVSHMEM Integration**: For distributed memory operations\n\n### Other Operators\n- **RoPE**: LLaMA-style rotary position embeddings (including LLaMA 3.1)\n- **Normalization**: RMSNorm, LayerNorm, Gemma-style fused operations\n- **Activations**: SiLU, GELU with fused gating\n\n## GPU Support\n\n| Architecture | Compute Capability | Example GPUs |\n|--------------|-------------------|------|\n| Turing | SM 7.5 | T4, RTX 20 series |\n| Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series |\n| Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series |\n| Hopper | SM 9.0 | H100, H200 |\n| Blackwell | SM 10.0, 10.3 | B200, B300 |\n| Blackwell | SM 11.0 | Jetson Thor |\n| Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark |\n\n> **Note:** Not all features are supported across all compute capabilities.\n\n## News\n\nLatest: [![GitHub Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fflashinfer-ai\u002Fflashinfer)](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Flatest)\n\nNotable updates:\n- [2025-10-08] Blackwell support added in [v0.4.0](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Ftag\u002Fv0.4.0)\n- [2025-03-10] [Blog Post](https:\u002F\u002Fflashinfer.ai\u002F2025\u002F03\u002F10\u002Fsampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.\n\n## Getting Started\n\n### Installation\n\n**Quickstart:**\n\n```bash\npip install flashinfer-python\n```\n\n**Package Options:**\n\n- **flashinfer-python**: Core package that compiles\u002Fdownloads kernels on first use\n- **flashinfer-cubin**: Pre-compiled kernel binaries for all supported GPU architectures\n- **flashinfer-jit-cache**: Pre-built kernel cache for specific CUDA versions\n\n**For faster initialization and offline usage**, install the optional packages to have most kernels pre-compiled:\n\n```bash\npip install flashinfer-python flashinfer-cubin\n# JIT cache (replace cu129 with your CUDA version)\npip install flashinfer-jit-cache --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fcu129\n```\n\n**For Blackwell (SM100+) CuTe DSL kernels**, install with the CUDA 13 extra to enable Blackwell-optimized kernels:\n\n```bash\npip install flashinfer-python[cu13]\n```\n\n### Verify Installation\n\n```bash\nflashinfer show-config\n```\n\n### Basic Usage\n\n```python\nimport torch\nimport flashinfer\n\n# Single decode attention\nq = torch.randn(32, 128, device=\"cuda\", dtype=torch.float16)  # [num_qo_heads, head_dim]\nk = torch.randn(2048, 32, 128, device=\"cuda\", dtype=torch.float16)  # [kv_len, num_kv_heads, head_dim]\nv = torch.randn(2048, 32, 128, device=\"cuda\", dtype=torch.float16)\n\noutput = flashinfer.single_decode_with_kv_cache(q, k, v)\n```\n\nSee [documentation](https:\u002F\u002Fdocs.flashinfer.ai\u002F) for comprehensive API reference and tutorials.\n\n### Install from Source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer.git --recursive\ncd flashinfer\npython -m pip install -v .\n```\n\n**For development**, install in editable mode:\n\n```bash\npython -m pip install --no-build-isolation -e . -v\n```\n\n> **Note:** When using `--no-build-isolation`, pip does not automatically install build dependencies. FlashInfer requires `setuptools>=77`. If you encounter an error like `AttributeError: module 'setuptools.build_meta' has no attribute 'prepare_metadata_for_build_editable'`, upgrade pip and setuptools first:\n> ```bash\n> python -m pip install --upgrade pip setuptools\n> ```\n\nBuild optional packages:\n\n```bash\n# flashinfer-cubin\ncd flashinfer-cubin\npython -m build --no-isolation --wheel\npython -m pip install dist\u002F*.whl\n```\n\n```bash\n# flashinfer-jit-cache (customize for your target GPUs)\nexport FLASHINFER_CUDA_ARCH_LIST=\"7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f\"\ncd flashinfer-jit-cache\npython -m build --no-isolation --wheel\npython -m pip install dist\u002F*.whl\n```\n\nFor more details, see the [Install from Source documentation](https:\u002F\u002Fdocs.flashinfer.ai\u002Finstallation.html#install-from-source).\n\n### Nightly Builds\n\n```bash\npip install -U --pre flashinfer-python --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002F --no-deps\npip install flashinfer-python  # Install dependencies from PyPI\npip install -U --pre flashinfer-cubin --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002F\n# JIT cache (replace cu129 with your CUDA version)\npip install -U --pre flashinfer-jit-cache --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002Fcu129\n```\n\n### CLI Tools\n\nFlashInfer provides several CLI commands for configuration, module management, and development:\n\n```bash\n# Verify installation and view configuration\nflashinfer show-config\n\n# List and inspect modules\nflashinfer list-modules\nflashinfer module-status\n\n# Manage artifacts and cache\nflashinfer download-cubin\nflashinfer clear-cache\n\n# For developers: generate compile_commands.json for IDE integration\nflashinfer export-compile-commands [output_path]\n```\n\nFor complete documentation, see the [CLI reference](https:\u002F\u002Fdocs.flashinfer.ai\u002Fcli.html).\n\n## API Logging\n\nFlashInfer provides comprehensive API logging for debugging. Enable it using environment variables:\n\n```bash\n# Enable logging (levels: 0=off (default), 1=basic, 3=detailed, 5=statistics)\nexport FLASHINFER_LOGLEVEL=3\n\n# Set log destination (stdout (default), stderr, or file path)\nexport FLASHINFER_LOGDEST=stdout\n```\n\nFor detailed information about logging levels, configuration, and advanced features, see [Logging](https:\u002F\u002Fdocs.flashinfer.ai\u002Flogging.html) in our documentation.\n\n## Custom Attention Variants\n\nUsers can customize their own attention variants with additional parameters. For more details, refer to our [JIT examples](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fblob\u002Fmain\u002Ftests\u002Futils\u002Ftest_jit_example.py).\n\n## CUDA Support\n\n**Supported CUDA Versions:** 12.6, 12.8, 13.0, 13.1\n\n> **Note:** FlashInfer strives to follow PyTorch's supported CUDA versions plus the latest CUDA release.\n\n## Adoption\n\nFlashInfer powers inference in:\n\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang)\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM)\n- [TGI (Text Generation Inference)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference)\n- [MLC-LLM](https:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fmlc-llm)\n- [LightLLM](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm)\n- [lorax](https:\u002F\u002Fgithub.com\u002Fpredibase\u002Florax)\n- [ScaleLLM](https:\u002F\u002Fgithub.com\u002Fvectorch-ai\u002FScaleLLM)\n\n## Acknowledgement\n\nFlashInfer is inspired by [FlashAttention](https:\u002F\u002Fgithub.com\u002Fdao-AILab\u002Fflash-attention\u002F), [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), [stream-K](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.03598), [CUTLASS](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fcutlass), and [AITemplate](https:\u002F\u002Fgithub.com\u002Ffacebookincubator\u002FAITemplate).\n\n## Citation\n\nIf you find FlashInfer helpful in your project or research, please consider citing our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005):\n\n```bibtex\n@article{ye2025flashinfer,\n    title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},\n    author = {\n      Ye, Zihao and\n      Chen, Lequn and\n      Lai, Ruihang and\n      Lin, Wuwei and\n      Zhang, Yineng and\n      Wang, Stephanie and\n      Chen, Tianqi and\n      Kasikci, Baris and\n      Grover, Vinod and\n      Krishnamurthy, Arvind and\n      Ceze, Luis\n    },\n    journal = {arXiv preprint arXiv:2501.01005},\n    year = {2025},\n    url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005}\n}\n```\n","FlashInfer是一个专为大规模语言模型（LLM）推理设计的高性能GPU内核库。它提供了包括注意力机制、GEMM和MoE操作在内的统一API，并支持多种后端实现，如FlashAttention-2\u002F3、cuDNN、CUTLASS及TensorRT-LLM等，以确保在不同硬件环境下都能达到最佳性能。项目特别强调了其在预填充、解码以及混合批处理场景下的优化能力，同时支持低精度计算（FP8\u002FFP4量化），适用于追求高效能且需兼容现代GPU架构的应用场合。此外，FlashInfer还具备生产级特性，如与CUDAGraph和torch.compile的良好兼容性，适合需要快速响应时间的服务部署。",2,"2026-06-11 03:36:09","high_star"]