[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-394":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":46,"readmeContent":47,"aiSummary":48,"trendingCount":16,"starSnapshotCount":16,"syncStatus":49,"lastSyncTime":50,"discoverSource":51},394,"vllm","vllm-project\u002Fvllm","vllm-project","A high-throughput and memory-efficient inference and serving engine for LLMs","https:\u002F\u002Fvllm.ai",null,"Python",83066,18122,567,2017,0,115,718,2805,521,120,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45],"amd","blackwell","cuda","deepseek","deepseek-v3","gpt","gpt-oss","inference","kimi","llama","llm","llm-serving","model-serving","moe","openai","pytorch","qwen","qwen3","tpu","transformer","2026-06-17 04:00:03","\u003C!-- markdownlint-disable MD001 MD041 -->\n\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fvllm-project\u002Fvllm\u002Fmain\u002Fdocs\u002Fassets\u002Flogos\u002Fvllm-logo-text-dark.png\">\n    \u003Cimg alt=\"vLLM\" src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fvllm-project\u002Fvllm\u002Fmain\u002Fdocs\u002Fassets\u002Flogos\u002Fvllm-logo-text-light.png\" width=55%>\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\nEasy, fast, and cheap LLM serving for everyone\n\u003C\u002Fh3>\n\n\u003Cp align=\"center\">\n| \u003Ca href=\"https:\u002F\u002Fdocs.vllm.ai\">\u003Cb>Documentation\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fblog.vllm.ai\u002F\">\u003Cb>Blog\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.06180\">\u003Cb>Paper\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fx.com\u002Fvllm_project\">\u003Cb>Twitter\u002FX\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fdiscuss.vllm.ai\">\u003Cb>User Forum\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fslack.vllm.ai\">\u003Cb>Developer Slack\u003C\u002Fb>\u003C\u002Fa> |\n\u003C\u002Fp>\n\n🔥 We have built a vLLM website to help you get started with vLLM. Please visit [vllm.ai](https:\u002F\u002Fvllm.ai) to learn more.\nFor events, please visit [vllm.ai\u002Fevents](https:\u002F\u002Fvllm.ai\u002Fevents) to join us.\n\n---\n\n## About\n\nvLLM is a fast and easy-to-use library for LLM inference and serving.\n\nOriginally developed in the [Sky Computing Lab](https:\u002F\u002Fsky.cs.berkeley.edu) at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.\n\nvLLM is fast with:\n\n- State-of-the-art serving throughput\n- Efficient management of attention key and value memory with [**PagedAttention**](https:\u002F\u002Fblog.vllm.ai\u002F2023\u002F06\u002F20\u002Fvllm.html)\n- Continuous batching of incoming requests, chunked prefill, prefix caching\n- Fast and flexible model execution with piecewise and full CUDA\u002FHIP graphs\n- Quantization: FP8, MXFP8\u002FMXFP4, NVFP4, INT8, INT4, GPTQ\u002FAWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and [more](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Ffeatures\u002Fquantization\u002Findex.html)\n- Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton\n- Optimized GEMM\u002FMoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL\n- Speculative decoding including n-gram, suffix, EAGLE, DFlash\n- Automatic kernel generation and graph-level transformations using torch.compile\n- Disaggregated prefill, decode, and encode\n\nvLLM is flexible and easy to use with:\n\n- Seamless integration with popular Hugging Face models\n- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more\n- Tensor, pipeline, data, expert, and context parallelism for distributed inference\n- Streaming outputs\n- Generation of structured outputs using xgrammar or guidance\n- Tool calling and reasoning parsers\n- OpenAI-compatible API server, plus Anthropic Messages API and gRPC support\n- Efficient multi-LoRA support for dense and MoE layers\n- Support for NVIDIA GPUs, AMD GPUs, and x86\u002FARM\u002FPowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.\n\nvLLM seamlessly supports 200+ model architectures on Hugging Face, including:\n\n- Decoder-only LLMs (e.g., Llama, Qwen, Gemma)\n- Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)\n- Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)\n- Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)\n- Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)\n- Reward and classification models (e.g., Qwen-Math)\n\nFind the full list of supported models [here](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fmodels\u002Fsupported_models.html).\n\n## Getting Started\n\nInstall vLLM with [`uv`](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) (recommended) or `pip`:\n\n```bash\nuv pip install vllm\n```\n\nOr [build from source](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fgetting_started\u002Finstallation\u002Fgpu\u002Findex.html#build-wheel-from-source) for development.\n\nVisit our [documentation](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002F) to learn more.\n\n- [Installation](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fgetting_started\u002Finstallation.html)\n- [Quickstart](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fgetting_started\u002Fquickstart.html)\n- [List of Supported Models](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fmodels\u002Fsupported_models.html)\n\n## Contributing\n\nWe welcome and value any contributions and collaborations.\nPlease check out [Contributing to vLLM](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fcontributing\u002Findex.html) for how to get involved.\n\n## Citation\n\nIf you use vLLM for your research, please cite our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.06180):\n\n```bibtex\n@inproceedings{kwon2023efficient,\n  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},\n  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},\n  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},\n  year={2023}\n}\n```\n\n## Contact Us\n\n\u003C!-- --8\u003C-- [start:contact-us] -->\n- For technical questions and feature requests, please use GitHub [Issues](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002Fissues)\n- For discussing with fellow users, please use the [vLLM Forum](https:\u002F\u002Fdiscuss.vllm.ai)\n- For coordinating contributions and development, please use [Slack](https:\u002F\u002Fslack.vllm.ai)\n- For security disclosures, please use GitHub's [Security Advisories](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002Fsecurity\u002Fadvisories) feature\n- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)\n\u003C!-- --8\u003C-- [end:contact-us] -->\n\n## Media Kit\n\n- If you wish to use vLLM's logo, please refer to [our media kit repo](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fmedia-kit)\n","vLLM 是一个高性能且内存高效的大型语言模型推理和服务引擎。它通过先进的PagedAttention技术有效管理注意力键值内存，支持连续请求批处理、分块预填充和前缀缓存等功能，从而实现一流的吞吐量。此外，vLLM还具备快速灵活的模型执行能力，支持多种量化方法以优化性能，并集成了FlashAttention等优化后的注意力核。该项目适用于需要高效处理大量并发请求的语言模型服务场景，如在线聊天机器人、自动文本生成等应用。其易于与Hugging Face模型集成的特点，加上对OpenAI API兼容的支持，使得vLLM成为开发人员部署大规模语言模型的理想选择。",2,"2026-06-17 02:35:22","top_all"]