[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-73418":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":15,"starSnapshotCount":15,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},73418,"llm-d","llm-d\u002Fllm-d","Achieve state of the art inference performance with modern accelerators on Kubernetes","https:\u002F\u002Fwww.llm-d.ai",null,"Shell",3340,521,58,192,0,30,60,171,90,30.15,"Apache License 2.0",false,"main",true,[26,27,28,29,30,31,32,33,34],"ai","cncf","distributed-inference","gpu","inference","intelligent-routing","kubernetes","llm","model-server","2026-06-12 02:03:13","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\">\n    \u003Cimg alt=\"llm-d Logo\" src=\".\u002Fdocs\u002Fassets\u002Fimages\u002Fllm-d-logo.png\" width=37%>\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Ch2 align=\"center\">\nAchieve SOTA Inference Performance On Any Accelerator\n\u003C\u002Fh2>\n\n [![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-8A2BE2?logo=readthedocs&logoColor=white&color=1BC070)](https:\u002F\u002Fwww.llm-d.ai)\n [![Release Status](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVersion-0.7-yellow)](https:\u002F\u002Fgithub.com\u002Fllm-d\u002Fllm-d\u002Freleases)\n [![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg)](.\u002FLICENSE)\n [![Join Slack](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin_Slack-blue?logo=slack)](https:\u002F\u002Fllm-d.ai\u002Fslack)\n\nllm-d is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes. We help you achieve the fastest \"time to state-of-the-art (SOTA) performance\" for key OSS large language models across most hardware accelerators and infrastructure providers with well-tested guides and real-world benchmarks.\n\nllm-d is a [Cloud Native Computing Foundation (CNCF)](https:\u002F\u002Fwww.cncf.io\u002F) sandbox project, founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA.\n\n## What does llm-d offer to production inference?\n\nModel servers like [vLLM](https:\u002F\u002Fdocs.vllm.ai) and [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) handle efficiently running large language models on accelerators. llm-d provides state-of-the-art orchestration and optimizations above model servers to serve high-scale real-world traffic efficiently and reliably. Our offerings are organized into four core themes:\n\n* **[Intelligent Routing:](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Fguides#intelligent-routing)** Maximize performance with prefix-cache and load-aware balancing, including experimental predicted latency-based scheduling to decrease latency and increase throughput.\n* **[Advanced KV-Cache Management:](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Fguides#advanced-kv-cache-management)** Increase the effective \"working set size\" for multi-turn requests with tiered offloading to CPU or disk and precise global indexing of the KV cache state.\n* **[Serving Large Models:](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Fguides#serving-large-models)** Optimize massive models (e.g., DeepSeek-R1, GPT-OSS) using prefill\u002Fdecode disaggregation and wide expert-parallelism over fast accelerator interconnects.\n* **[Operational Excellence:](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Fguides#operational-excellence)** Ensure production stability with intelligent flow control for multi-tenant serving and proactive, SLO-aware autoscaling based on real-time inference signals.\n* **[Batch Processing:](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Fguides#experimental)** Efficiently manage large-scale offline inference with OpenAI-compatible Batch APIs and asynchronous processing to maximize hardware utilization.\n\nFor a complete list of tested recipes and architectural patterns, see our [well-lit path guides](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Fguides). These guides provide benchmarked recipes and Helm charts to start serving quickly with best practices common to production deployments. Our intent is to eliminate the heavy lifting common in tuning and deploying generative AI inference on modern accelerators.\n\n## Performance Highlights\n\nValidated performance gains from production deployments and partner benchmarks:\n\n- **3x higher output throughput** and **2x faster TTFT** with prefix-cache-aware routing vs round-robin — Llama 3.1 70B on 4× AMD MI300X, Tesla \u002F Red Hat ([blog](https:\u002F\u002Fllm-d.ai\u002Fblog\u002Fproduction-grade-llm-inference-at-scale-kserve-llm-d-vllm))\n- **40% reduction in TTFT and ITL** with predicted-latency scheduling vs heuristics on NVIDIA GPUs, Google ([blog](https:\u002F\u002Fllm-d.ai\u002Fblog\u002Fpredicted-latency-based-scheduling-for-llms))\n- **Up to 70% higher tokens\u002Fsec** with prefill\u002Fdecode disaggregation vs standard vLLM — GPT-OSS on NVIDIA B200 (p6-b200), AWS ([blog](https:\u002F\u002Faws.amazon.com\u002Fblogs\u002Fmachine-learning\u002Fintroducing-disaggregated-inference-on-aws-powered-by-llm-d\u002F))\n- **10–30% throughput improvement** with disaggregated serving on identical infrastructure — GPT-OSS-120B and Llama 3.3 70B on AMD MI300X, Oracle ([blog](https:\u002F\u002Fblogs.oracle.com\u002Fai-and-datascience\u002Fllm-inference-at-scale-with-llm-d-on-oci))\n- **50k tokens\u002Fsec** cluster throughput with Wide Expert-Parallelism — 16×16 NVIDIA B200, ~3.1k tok\u002Fs per GPU ([blog](https:\u002F\u002Fllm-d.ai\u002Fblog\u002Fllm-d-v0.5-sustaining-performance-at-scale))\n- **13.9x throughput improvement** with hierarchical KV offloading at 250 concurrent users vs GPU-only — 4× NVIDIA H100 ([blog](https:\u002F\u002Fllm-d.ai\u002Fblog\u002Fllm-d-v0.5-sustaining-performance-at-scale))\n\nExplore detailed, reproducible benchmarks on [Prism](https:\u002F\u002Fprism.llm-d.ai).\n\n## Get Started Now\n\nReady to achieve SOTA performance? Follow our [Quickstart Guide](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Fgetting-started\u002Fquickstart) to deploy your first optimized inference service on Kubernetes. You'll learn how to set up the llm-d stack, configure the intelligent router, and validate performance with production-ready benchmarks.\n\n> [!TIP]\n> Most users begin with our [Optimized Baseline](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Fguides\u002Foptimized-baseline), which provides a high-performance foundation for a wide range of LLM serving use cases.\n\n## Latest News 🔥\n\n- [2026-05] The v0.7 release introduces an optimized baseline renamed and stabilized, kustomize-first migrated guides, expanded nightly CI (OpenShift, GKE, CoreWeave), predicted-latency scheduling GA, batch gateway (experimental), and revamped project-wide documentation.\n- [2026-03] llm-d [joins the CNCF as a Sandbox project](https:\u002F\u002Fwww.cncf.io\u002Fblog\u002F2026\u002F03\u002F24\u002Fwelcome-llm-d-to-the-cncf-evolving-kubernetes-into-sota-ai-infrastructure\u002F)! Founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA, with support from AMD, Cisco, Hugging Face, Intel, Lambda, Mistral AI, UC Berkeley, and University of Chicago. We're excited to collaborate openly on building flexible, future-proof AI infrastructure.\n- [2026-02] The [v0.5](https:\u002F\u002Fllm-d.ai\u002Fblog\u002Fllm-d-v0.5-sustaining-performance-at-scale) introduces reproducible benchmark workflows, hierarchical KV offloading, cache-aware LoRA routing, active-active HA, UCCL-based transport resilience, and scale-to-zero autoscaling; validated ~3.1k tok\u002Fs per B200 decode GPU (wide-EP) and up to 50k output tok\u002Fs on a 16×16 B200 prefill\u002Fdecode topology with order-of-magnitude TTFT reduction vs round-robin baseline.\n- [2025-12] The [v0.4](https:\u002F\u002Fllm-d.ai\u002Fblog\u002Fllm-d-v0.4-achieve-sota-inference-across-accelerators) release demonstrates 40% reduction in per output token latency for DeepSeek V3.1 on H200 GPUs, Intel XPU and Google TPU disaggregation support for lower time to first token, a new well-lit path for prefix cache offload to vLLM-native CPU memory tiering, and a preview of the workload variant autoscaler improving model-as-a-service efficiency.\n\n\n\u003C!-- Previous News  -->\n\u003C!-- - [2025-08] Read more about the [optimized-baseline](https:\u002F\u002Fllm-d.ai\u002Fblog\u002Fintelligent-optimized-baseline-with-llm-d), including a deep dive on how different balancing techniques are composed to improve throughput without overloading replicas. -->\n\n## 🧱 Architecture\n\nllm-d accelerates distributed inference by integrating industry-standard open technologies like vLLM and Kubernetes. For more details, see our full [Architecture Documentation](https:\u002F\u002Fllm-d.ai\u002Fdocs\u002Farchitecture).\n\n\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\">\n    \u003Cimg alt=\"llm-d Arch\" src=\".\u002Fdocs\u002Fassets\u002Fimages\u002Fllm-d-arch.svg\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\n## 📦 Releases\n\nOur [guides](.\u002Fguides\u002FREADME.md) are living docs and kept current. For details about the Helm charts and component releases, visit our [GitHub Releases page](https:\u002F\u002Fgithub.com\u002Fllm-d\u002Fllm-d\u002Freleases) to review release notes.\n\nSee the [accelerator docs](.\u002Fdocs\u002Faccelerators\u002FREADME.md) for points of contact and more details about the accelerators, networks, and configurations tested.\n\n## Contribute\n\nWe adhere to the [CNCF Code of Conduct](https:\u002F\u002Fgithub.com\u002Fcncf\u002Ffoundation\u002Fblob\u002Fmain\u002Fcode-of-conduct.md).\n\n- See [our project overview](PROJECT.md) for more details on our development process and governance.\n- Review [our contributing guidelines](CONTRIBUTING.md) for detailed information on how to contribute to the project.\n- Join one of our [Special Interest Groups (SIGs)](SIGS.md) to contribute to specific areas of the project and collaborate with domain experts.\n- We use Slack to discuss development across organizations. Please join: [Slack](https:\u002F\u002Fllm-d.ai\u002Fslack)\n- We host a bi-weekly standup for contributors every other Wednesday at 12:30 PM ET, as well as meetings for various SIGs. You can find them in the [shared llm-d calendar](https:\u002F\u002Fred.ht\u002Fllm-d-public-calendar)\n- We use Google Groups to share architecture diagrams and other content. Please join: [Google Group](https:\u002F\u002Fgroups.google.com\u002Fg\u002Fllm-d-contributors)\n\n## License\n\nThis project is licensed under Apache License 2.0. See the [LICENSE file](LICENSE) for details.\n","llm-d 是一个专为 Kubernetes 上的生产部署优化的高性能分布式推理服务栈。它通过智能路由、先进的KV缓存管理、大规模模型服务以及卓越的操作性等功能，帮助用户在各种硬件加速器和基础设施提供商上实现关键开源大语言模型的最佳推理性能。其核心功能包括基于预测延迟的调度以减少延迟并提高吞吐量、分层卸载到CPU或磁盘来增加多轮请求的有效工作集大小、使用预填充\u002F解码分离和宽专家并行技术优化超大规模模型处理，以及基于实时推理信号的智能流控和SLO感知自动扩展等。该项目适用于需要高效率、高可靠性的大规模在线和离线推理场景。",2,"2026-06-11 03:45:26","high_star"]