[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72500":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":14,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":24,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},72500,"production-stack","vllm-project\u002Fproduction-stack","vllm-project","vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization","https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Fproduction-stack",null,"Python",2392,419,27,103,0,9,20,58,93.67,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:06","# vLLM Production Stack: reference stack for production vLLM deployment\n\n| [**Blog**](https:\u002F\u002Flmcache.github.io) | [**Docs**](https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Fproduction-stack) | [**Production-Stack Slack Channel**](https:\u002F\u002Fcommunityinviter.com\u002Fapps\u002Fvllm-dev\u002Fjoin-vllm-developers-slack) | [**LMCache Slack**](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Flmcacheworkspace\u002Fshared_invite\u002Fzt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ) | [**Interest Form**](https:\u002F\u002Fforms.gle\u002FmQfQDUXbKfp2St1z7) |\n\n## Latest News\n\n- 📄 [Official documentation](https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Fproduction-stack) released for production-stack!\n- ✨ [Cloud Deployment Tutorials](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Ftutorials) for Lambda Labs, AWS EKS, Google GCP are out!\n- 🛤️ 2026 roadmap is released! [Join the discussion now](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fissues\u002F855)!\n- 🔥 vLLM Production Stack is released! Check out our [release blogs](https:\u002F\u002Fblog.lmcache.ai\u002F2025-01-21-stack-release) posted on January 22, 2025.\n\n## Community Events\n\nWe host **bi-weekly** community meetings at the following timeslot:\n\n- Every other Tuesdays at 5:30 PM PT – [Add to Calendar](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1D4SqQiqzdSx_xsEwS0QTd592zd3Xourh)\n\nAll are welcome to join!\n\n## Introduction\n\n**vLLM Production Stack** project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:\n\n- 🚀 Scale from a single vLLM instance to a distributed vLLM deployment without changing any application code\n- 💻 Monitor the metrics through a web dashboard\n- 😄 Enjoy the performance benefits brought by request routing and KV cache offloading\n\n## Step-By-Step Tutorials\n\n0. How To [*Install Kubernetes (kubectl, helm, minikube, etc)*](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Ftutorials\u002F00-install-kubernetes-env.md)?\n1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Lambda Labs, Azure)*](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Ftutorials\u002Fcloud_deployments)?\n2. How To [*Set up a Minimal vLLM Production Stack*](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Ftutorials\u002F01-minimal-helm-installation.md)?\n3. How To [*Customize vLLM Configs (optional)*](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Ftutorials\u002F02-basic-vllm-config.md)?\n4. How to [*Load Your LLM Weights*](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Ftutorials\u002F03-load-model-from-pv.md)?\n5. How to [*Launch Different LLMs in vLLM Production Stack*](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Ftutorials\u002F04-launch-multiple-model.md)?\n6. How to [*Enable KV Cache Offloading with LMCache*](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Ftutorials\u002F05-offload-kv-cache.md)?\n\n## Architecture\n\nThe stack is set up using [Helm](https:\u002F\u002Fhelm.sh\u002Fdocs\u002F), and contains the following key parts:\n\n- **Serving engine**: The vLLM engines that run different LLMs.\n- **Request router**: Directs requests to appropriate backends based on routing keys or session IDs to maximize KV cache reuse.\n- **Observability stack**: monitors the metrics of the backends through [Prometheus](https:\u002F\u002Fgithub.com\u002Fprometheus\u002Fprometheus) + [Grafana](https:\u002F\u002Fgrafana.com\u002F)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F8f05e7b9-0513-40a9-9ba9-2d3acca77c0c\" alt=\"Architecture of the stack\" width=\"80%\"\u002F>\n\u003C\u002Fp>\n\n## Roadmap\n\nWe are actively working on this project and will release the following features soon. Please stay tuned!\n\n- **Autoscaling** based on vLLM-specific metrics\n- Support for **disaggregated prefill**\n- **Router improvements** (e.g., more performant router using non-python languages, KV-cache-aware routing algorithm, better fault tolerance, etc)\n\n## Deploying the stack via Helm\n\n### Prerequisites\n\n- A running Kubernetes (K8s) environment with GPUs\n  - Run `cd utils && bash install-minikube-cluster.sh`\n  - Or follow our [tutorial](tutorials\u002F00-install-kubernetes-env.md)\n\n### Deployment\n\nvLLM Production Stack can be deployed via helm charts. Clone the repo to local and execute the following commands for a minimal deployment:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack.git\ncd production-stack\u002F\nhelm repo add vllm https:\u002F\u002Fvllm-project.github.io\u002Fproduction-stack\nhelm install vllm vllm\u002Fvllm-stack -f tutorials\u002Fassets\u002Fvalues-01-minimal-example.yaml\n```\n\nThe deployed stack provides the same [**OpenAI API interface**](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fserving\u002Fopenai_compatible_server.html?ref=blog.mozilla.ai#openai-compatible-server) as vLLM, and can be accessed through kubernetes service.\n\nTo validate the installation and send a query to the stack, refer to [this tutorial](tutorials\u002F01-minimal-helm-installation.md).\n\nFor more information about customizing the helm chart, please refer to [values.yaml](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Fblob\u002Fmain\u002Fhelm\u002Fvalues.yaml) and our other [tutorials](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fproduction-stack\u002Ftree\u002Fmain\u002Ftutorials).\n\n### Uninstall\n\n```bash\nhelm uninstall vllm\n```\n\n## Grafana Dashboard\n\n### Features\n\nThe Grafana dashboard provides the following insights:\n\n1. **Available vLLM Instances**: Displays the number of healthy instances.\n2. **Request Latency Distribution**: Visualizes end-to-end request latency.\n3. **Time-to-First-Token (TTFT) Distribution**: Monitors response times for token generation.\n4. **Number of Running Requests**: Tracks the number of active requests per instance.\n5. **Number of Pending Requests**: Tracks requests waiting to be processed.\n6. **GPU KV Usage Percent**: Monitors GPU KV cache usage.\n7. **GPU KV Cache Hit Rate**: Displays the hit rate for the GPU KV cache.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F05766673-c449-4094-bdc8-dea6ac28cb79\" alt=\"Grafana dashboard to monitor the deployment\" width=\"80%\"\u002F>\n\u003C\u002Fp>\n\n### Configuration\n\nSee the details in [`helm\u002FREADME.md`](.\u002Fhelm\u002FREADME.md#Observability)\n\n## Router\n\nThe router ensures efficient request distribution among backends. It supports:\n\n- Routing to endpoints that run different models\n- Exporting observability metrics for each serving engine instance, including QPS, time-to-first-token (TTFT), number of pending\u002Frunning\u002Ffinished requests, and uptime\n- Automatic service discovery and fault tolerance via the Kubernetes API\n- Model aliases\n- Multiple routing algorithms:\n  - Round-robin routing\n  - Session-ID based routing\n  - Prefix-aware routing (WIP)\n\nPlease refer to the [router documentation](.\u002Fsrc\u002Fvllm_router\u002FREADME.md) for more details.\n\n## Contributing\n\nWe welcome and value any contributions and collaborations. Please check out [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.\n\n## License\n\nThis project is licensed under Apache License 2.0. See the `LICENSE` file for details.\n\n## Sponsors\n\nWe are grateful to our sponsors who support our development and benchmarking efforts:\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgmicloud.ai\">\n    \u003Cimg src=\"https:\u002F\u002Fcdn.prod.website-files.com\u002F6683d8c52e4e62685a8d90cf\u002F67a0a0064683945b0cf77f25_GMI%20Cloud%20Logo_Black.svg\" alt=\"GMI Cloud Logo\" width=\"200\"\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\nFor any issues or questions, feel free to open an issue or contact us ([@ruizhang0101](https:\u002F\u002Fgithub.com\u002Fruizhang0101), [@ApostaC](https:\u002F\u002Fgithub.com\u002FApostaC), [@YuhanLiu11](https:\u002F\u002Fgithub.com\u002FYuhanLiu11), [@Shaoting-Feng](https:\u002F\u002Fgithub.com\u002FShaoting-Feng)).\n","vLLM Production Stack 是一个面向Kubernetes原生集群部署的参考系统，旨在通过社区驱动的性能优化支持大规模语言模型（LLM）的应用。该项目利用Python开发，提供了一个从单个vLLM实例无缝扩展到分布式部署的解决方案，并支持通过Web仪表板监控关键指标。它还引入了请求路由和键值缓存卸载技术来提升整体性能。适用于需要在云平台上高效管理和部署大规模语言模型的企业或开发者，特别是在AWS、GCP等主流云计算环境中寻求简化部署流程和提高资源利用率的场景。",2,"2026-06-11 03:42:19","high_star"]