[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-5603":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},5603,"dynamo","ai-dynamo\u002Fdynamo","ai-dynamo","A Datacenter Scale Distributed Inference Serving Framework","https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Flatest",null,"Rust",7241,1232,71,164,0,7,53,467,34,40.27,"Other",false,"main",true,[],"2026-06-12 02:01:12","\u003C!--\nSPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\nSPDX-License-Identifier: Apache-2.0\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n-->\n\n![Dynamo banner](.\u002Fdocs\u002Fassets\u002Fimg\u002Fdynamo-frontpage-banner.png)\n\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![GitHub Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fai-dynamo\u002Fdynamo)](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fdynamo\u002Freleases\u002Flatest)\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fai-dynamo)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fai-dynamo\u002F)\n[![Ask DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002Fai-dynamo\u002Fdynamo)\n[![Discord](https:\u002F\u002Fdcbadge.limes.pink\u002Fapi\u002Fserver\u002FD92uqZRjCZ?style=flat)](https:\u002F\u002Fdiscord.gg\u002FD92uqZRjCZ)\n![Community Contributors](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcommunity_contributors-70%2B-brightgreen)\n\n| **[Docs](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002F)** | **[Roadmap](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fdynamo\u002Fissues\u002F5506)** | **[Recipes](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fdynamo\u002Ftree\u002Fmain\u002Frecipes)** | **[Examples](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fdynamo\u002Ftree\u002Fmain\u002Fexamples)** | **[Prebuilt Containers](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fteams\u002Fai-dynamo\u002Fcollections\u002Fai-dynamo)** | **[Digest](docs\u002Fdigest\u002Findex.mdx)** | **[Design Proposals](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fenhancements)** | **[How to Contribute](#community-and-contributing)** |\n\n# Dynamo\n\n\u003C!-- TEMPORARY BANNER: remove once V4 recipes mature. -->\n> [!NOTE]\n> **Day-0 DeepSeek-V4 recipes available.** Tested Kubernetes deployment paths for [DeepSeek-V4-Pro](recipes\u002Fdeepseek-v4\u002Fdeepseek-v4-pro\u002F) and [DeepSeek-V4-Flash](recipes\u002Fdeepseek-v4\u002Fdeepseek-v4-flash\u002F) are merged to main on both **vLLM** and **SGLang**, with a prebuilt SGLang container image published on NGC.\n\n**The open-source, datacenter-scale inference stack.** Dynamo is the orchestration layer above inference engines — it doesn't replace SGLang, TensorRT-LLM, or vLLM, it turns them into a coordinated multi-node inference system. Disaggregated serving, intelligent routing, multi-tier KV caching, and automatic scaling work together to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.\n\nBuilt in Rust for performance, Python for extensibility.\n\n## When to use Dynamo\n\n- You're serving LLMs across **multiple GPUs or nodes** and need to coordinate them\n- You want **KV-aware routing** to avoid redundant prefill computation\n- You need to **independently scale prefill and decode** (disaggregated serving)\n- You want **automatic scaling** that meets latency SLAs at minimum total cost of ownership (TCO)\n- You need **fast cold-starts** when spinning up new replicas\n\nIf you're running a single model on a single GPU, your inference engine alone is probably sufficient.\n\n**Feature support at a glance:**\n\n| | [SGLang](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fbackends\u002Fsg-lang) | [TensorRT-LLM](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fbackends\u002Ftensor-rt-llm) | [vLLM](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fbackends\u002Fv-llm) |\n|---|:----:|:----------:|:--:|\n| [**Disaggregated Serving**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fdesign-docs\u002Fdisaggregated-serving) | ✅ | ✅ | ✅ |\n| [**KV-Aware Routing**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fcomponents\u002Frouter) | ✅ | ✅ | ✅ |\n| [**SLA-Based Planner**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fcomponents\u002Fplanner\u002Fplanner-guide) | ✅ | ✅ | ✅ |\n| [**KVBM**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fcomponents\u002Fkvbm) | 🚧 | ✅ | ✅ |\n| [**Multimodal**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fuser-guides\u002Fmultimodal) | ✅ | ✅ | ✅ |\n| [**Tool Calling**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fuser-guides\u002Ftool-calling) | ✅ | ✅ | ✅ |\n\n> **[Full Feature Matrix →](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fresources\u002Ffeature-matrix)** — LoRA, request migration, speculative decoding, and feature interactions.\n\n## Key Results\n\n| Result | Context |\n|--------|---------|\n| **7x** higher throughput per GPU | DeepSeek R1 on GB200 NVL72 w\u002F Dynamo vs B200 without ([InferenceX](https:\u002F\u002Finferencex.semianalysis.com\u002F)) |\n| **7x** faster model startup | ModelExpress weight streaming (DeepSeek-V3 on H200) |\n| **2x** faster time to first token | KV-aware routing, Qwen3-Coder 480B ([Baseten benchmark](https:\u002F\u002Fwww.baseten.co\u002Fblog\u002Fhow-baseten-achieved-2x-faster-inference-with-nvidia-dynamo\u002F)) |\n| **80%** fewer SLA breaches | Planner autoscaling at 5% lower TCO ([Alibaba APSARA 2025 @ 2:50:00](https:\u002F\u002Fyunqi.aliyun.com\u002F2025\u002Fsession?agendaId=6062)) |\n| **750x** higher throughput | DeepSeek-R1 on GB300 NVL72 ([InferenceXv2](https:\u002F\u002Finferencex.semianalysis.com\u002F)) |\n\n\n## What Dynamo Does\n\nMost inference engines optimize a single GPU or a single node. Dynamo is the **orchestration layer above them** — it turns a cluster of GPUs into a coordinated inference system.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fimg\u002Fdynamo-readme-overview.svg\" alt=\"Dynamo architecture overview\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\n**[Architecture Deep Dive →](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fdesign-docs\u002Foverall-architecture)**\n\n### Core Capabilities\n\n| Capability | What it does | Why it matters |\n|------------|-------------|----------------|\n| [**Disaggregated Prefill\u002FDecode**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fdesign-docs\u002Fdisaggregated-serving) | Separates prefill and decode into independently scalable GPU pools | Maximizes GPU utilization; each phase runs on hardware tuned for its workload |\n| [**KV-Aware Routing**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fcomponents\u002Frouter) | Routes requests based on worker load and KV cache overlap | Eliminates redundant prefill computation — 2x faster TTFT |\n| [**KV Block Manager (KVBM)**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fcomponents\u002Fkvbm) | Offloads KV cache across GPU → CPU → SSD → remote storage | Extends effective context length beyond GPU memory |\n| [**ModelExpress**](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fmodelexpress) | Streams model weights GPU-to-GPU via NIXL\u002FNVLink | 7x faster cold-start for new replicas |\n| [**Planner**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fcomponents\u002Fplanner\u002Fplanner-guide) | SLA-driven autoscaler that profiles workloads and right-sizes pools | Meets latency targets at minimum total cost of ownership (TCO) |\n| [**Grove**](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fgrove) | K8s operator for topology-aware gang scheduling (NVL72) | Places workloads optimally across racks, hosts, and NUMA nodes |\n| [**AIConfigurator**](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Faiconfigurator) | Simulates 10K+ deployment configs in seconds | Finds optimal serving config without burning GPU-hours |\n| [**Fault Tolerance**](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fuser-guides\u002Ffault-tolerance\u002Frequest-migration) | Canary health checks + in-flight request migration | Workers fail; user requests don't |\n\n### New in 1.0\n\n- **Zero-config deploy ([DGDR](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fkubernetes-deployment\u002Fdeployment-guide\u002Fdeploying-your-first-model))** *(beta):* Specify model, HW, and SLA in one YAML — AIConfigurator auto-profiles the workload, Planner optimizes the topology, and Dynamo deploys\n- **Agentic inference:** Per-request hints for latency priority, expected output length, and cache pinning TTL. [LangChain](https:\u002F\u002Fdocs.langchain.com\u002Foss\u002Fpython\u002Fintegrations\u002Fchat\u002Fnvidia_ai_endpoints#use-with-nvidia-dynamo) + [NeMo Agent Toolkit](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Agent-Toolkit) integrations\n- **Multimodal E\u002FP\u002FD:** Disaggregated encode\u002Fprefill\u002Fdecode with embedding cache — 30% faster TTFT on image workloads\n- **Video generation:** Native [FastVideo](https:\u002F\u002Fgithub.com\u002Fhao-ai-lab\u002FFastVideo) + [SGLang Diffusion](https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-02-16-sglang-diffusion-advanced-optimizations\u002F) support — real-time 1080p on single B200\n- **K8s Inference Gateway plugin:** KV-aware routing inside the standard Kubernetes gateway\n- **Storage-tier KV offload:** S3\u002FAzure blob support + global KV events for cluster-wide cache visibility\n\n## Quick Start\n\n### Option A: Container (fastest)\n\n```bash\n# Pull a prebuilt container (SGLang example)\ndocker run --gpus all --network host --rm -it nvcr.io\u002Fnvidia\u002Fai-dynamo\u002Fsglang-runtime:1.1.0\n\n# Inside the container — start frontend and worker\npython3 -m dynamo.frontend --http-port 8000 --discovery-backend file > \u002Fdev\u002Fnull 2>&1 &\npython3 -m dynamo.sglang --model-path Qwen\u002FQwen3-0.6B --discovery-backend file &\n\n# Send a request\ncurl -s localhost:8000\u002Fv1\u002Fchat\u002Fcompletions -H \"Content-Type: application\u002Fjson\" -d '{\n  \"model\": \"Qwen\u002FQwen3-0.6B\",\n  \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n  \"max_tokens\": 100\n}' | jq\n```\n\nAlso available: [`tensorrtllm-runtime:1.1.0`](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fresources\u002Frelease-artifacts) and [`vllm-runtime:1.1.0`](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fresources\u002Frelease-artifacts).\n\n### Option B: Install from PyPI\n\nInstall [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) (`curl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh`), then:\n\n```bash\nuv pip install --prerelease=allow \"ai-dynamo[sglang]\"   # or [vllm]\n```\n\n> **Note:** TensorRT-LLM requires `pip` with `--extra-index-url https:\u002F\u002Fpypi.nvidia.com`. See the [install guide](docs\u002Fgetting-started\u002Flocal-installation.md) for TRT-LLM-specific instructions.\n\nThen start the frontend and a worker as shown above. See the [full installation guide](docs\u002Fgetting-started\u002Flocal-installation.md) for system dependencies and backend-specific notes.\n\n### Option C: Kubernetes (recommended)\n\nFor production multi-node clusters, install the [Dynamo Platform](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fkubernetes-deployment\u002Fdeployment-guide) and deploy with a single manifest:\n\n```yaml\n# Zero-config deploy: specify model + SLA, Dynamo handles the rest\napiVersion: nvidia.com\u002Fv1beta1\nkind: DynamoGraphDeploymentRequest\nmetadata:\n  name: my-model\nspec:\n  model: Qwen\u002FQwen3-0.6B\n  backend: vllm\n  sla:\n    ttft: 200.0   # ms\n    itl: 20.0     # ms\n  autoApply: true\n```\n\nPre-built recipes for common models:\n\n| Model | Framework | Mode | Recipe |\n|-------|-----------|------|--------|\n| Llama-3-70B | vLLM | Aggregated | [View](recipes\u002Fllama-3-70b\u002Fvllm\u002F) |\n| DeepSeek-R1 | SGLang | Disaggregated | [View](recipes\u002Fdeepseek-r1\u002Fsglang\u002F) |\n| Qwen3-32B-FP8 | TensorRT-LLM | Aggregated | [View](recipes\u002Fqwen3-32b-fp8\u002Ftrtllm\u002F) |\n\nSee [recipes\u002F](recipes\u002FREADME.md) for the full list. Cloud-specific guides: [AWS EKS](examples\u002Fdeployments\u002FEKS\u002F) · [Google GKE](examples\u002Fdeployments\u002FGKE\u002F)\n\n## Building from Source\n\nFor contributors who want to build and develop locally. See the [full build guide](docs\u002Fgetting-started\u002Fbuilding-from-source.md) for details.\n\n```bash\n# Install system deps (Ubuntu 24.04)\nsudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake\n\n# Install Rust\ncurl --proto '=https' --tlsv1.2 -sSf https:\u002F\u002Fsh.rustup.rs | sh && source $HOME\u002F.cargo\u002Fenv\n\n# Create venv and build\nuv venv dynamo && source dynamo\u002Fbin\u002Factivate\nuv pip install pip maturin\ncd lib\u002Fbindings\u002Fpython && maturin develop --uv && cd $PROJECT_ROOT\nuv pip install -e lib\u002Fgpu_memory_service\nuv pip install -e .\n```\n\n> VSCode\u002FCursor users: see the [`.devcontainer`](.devcontainer\u002FREADME.md) for a pre-configured dev environment.\n\n## Community and Contributing\n\nDynamo is built in the open with an OSS-first development model. We welcome contributions of all kinds.\n\n- **[Contribution Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fgetting-started\u002Fcontribution-guide)** — How to contribute code, docs, and recipes\n- **[Design Proposals](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fenhancements)** — RFCs for major features\n- **[Office Hours](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)** — Biweekly calls\n- **[Community Meetings](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1uR8xD_hlYGwV6QspvSc36k1H-wo1BUcVmFbHH9xlXd8\u002Fview)** – Weekly (Fri 2:30 PM PT) development community meetings\n- **[Discord](https:\u002F\u002Fdiscord.gg\u002FD92uqZRjCZ)** — Chat with the team and community\n- **[Dynamo Day Recordings](https:\u002F\u002Fnvevents.nvidia.com\u002Fdynamoday)** — Deep dives from production users\n\n## Latest News\n\n- [03\u002F15] [Dynamo 1.0 is here — production-ready with strong community adoption](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fintroducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models\u002F)\n- [03\u002F15] [NVIDIA Blackwell Ultra sets new inference records in MLPerf](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fnvidia-blackwell-ultra-sets-new-inference-records-in-mlperf-debut\u002F)\n- [03\u002F15] [NVIDIA Blackwell leads on SemiAnalysis InferenceMax benchmarks](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fnvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks\u002F)\n- [12\u002F05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https:\u002F\u002Fquantumzeitgeist.com\u002Fkimi-k2-nvidia-ai-ai-breakthrough\u002F)\n- [12\u002F02] [Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo](https:\u002F\u002Fwww.marktechpost.com\u002F2025\u002F12\u002F02\u002Fnvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems\u002F)\n- [11\u002F20] [Dell integrates PowerScale with NIXL for 19x faster TTFT](https:\u002F\u002Fwww.dell.com\u002Fen-us\u002Fdt\u002Fcorporate\u002Fnewsroom\u002Fannouncements\u002Fdetailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)\n\n\u003Cdetails>\n\u003Csummary>Older news\u003C\u002Fsummary>\n\nDynamo provides comprehensive benchmarking tools:\n\n- **[Benchmarking Guide](docs\u002Fbenchmarks\u002Fbenchmarking.md)** – Compare deployment topologies using AIPerf\n- **[SLA-Driven Deployments](docs\u002Fcomponents\u002Fplanner\u002Fplanner-guide.md)** – Optimize deployments to meet SLA requirements\n\n## Frontend OpenAPI Specification\n\nThe OpenAI-compatible frontend exposes an OpenAPI 3 spec at `\u002Fopenapi.json`. To generate without running the server:\n\n```bash\ncargo run -p dynamo-llm --bin generate-frontend-openapi\n```\n\nThis writes to `docs\u002Freference\u002Fapi\u002Fopenapi.json`.\n\n## Service Discovery and Messaging\n\nDynamo uses TCP for inter-component communication. On Kubernetes, native resources ([CRDs + EndpointSlices](docs\u002Fkubernetes\u002Fservice-discovery.md)) handle service discovery. External services are optional for most deployments:\n\n| Deployment | etcd | NATS | Notes |\n|------------|------|------|-------|\n| **Local Development** | ❌ Not required | ❌ Not required | Pass `--discovery-backend file`; vLLM also needs `--kv-events-config '{\"enable_kv_cache_events\": false}'` |\n| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |\n\n> **Note:** KV-Aware Routing requires NATS for prefix caching coordination.\n\nFor Slurm or other distributed deployments (and KV-aware routing):\n\n- [etcd](https:\u002F\u002Fetcd.io\u002F) can be run directly as `.\u002Fetcd`.\n- [nats](https:\u002F\u002Fnats.io\u002F) needs JetStream enabled: `nats-server -js`.\n\nTo quickly setup both: `docker compose -f deploy\u002Fdocker-compose.yml up -d`\n\n## More News\n\n- [11\u002F20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https:\u002F\u002Fwww.dell.com\u002Fen-us\u002Fdt\u002Fcorporate\u002Fnewsroom\u002Fannouncements\u002Fdetailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)\n- [11\u002F20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https:\u002F\u002Fsiliconangle.com\u002F2025\u002F11\u002F20\u002Fnvidia-weka-kv-cache-solution-ai-inferencing-sc25\u002F)\n- [11\u002F13] [Dynamo Office Hours Playlist](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)\n- [10\u002F16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https:\u002F\u002Fwww.baseten.co\u002Fblog\u002Fhow-baseten-achieved-2x-faster-inference-with-nvidia-dynamo\u002F)\n- [12\u002F01] [InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https:\u002F\u002Fwww.infoq.com\u002Fnews\u002F2025\u002F12\u002Fnvidia-dynamo-kubernetes\u002F)\n\n\u003C\u002Fdetails>\n\n## Reference\n\n- **[Support Matrix](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fresources\u002Fsupport-matrix)** — Hardware, OS, CUDA, and backend versions\n- **[Feature Matrix](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fresources\u002Ffeature-matrix)** — Detailed backend compatibility\n- **[Release Artifacts](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fresources\u002Frelease-artifacts)** — Containers, wheels, Helm charts\n- **[Service Discovery](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fkubernetes-deployment\u002Fdeployment-guide\u002Fservice-discovery)** — K8s-native vs etcd vs file-based discovery\n- **[Benchmarking Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fdynamo\u002Fuser-guides\u002Fbenchmarking)** — Compare deployment topologies with AIPerf\n\n\u003C!-- Reference links for Feature Compatibility Matrix -->\n[disagg]: docs\u002Fdesign-docs\u002Fdisagg-serving.md\n[kv-routing]: docs\u002Fcomponents\u002Frouter\u002FREADME.md\n[planner]: docs\u002Fcomponents\u002Fplanner\u002Fplanner-guide.md\n[kvbm]: docs\u002Fcomponents\u002Fkvbm\u002FREADME.md\n[migration]: docs\u002Ffault-tolerance\u002Frequest-migration.md\n[lora]: examples\u002Fbackends\u002Fvllm\u002Fdeploy\u002Flora\u002FREADME.md\n[tools]: docs\u002Fagents\u002Ftool-calling.md\n","Dynamo 是一个数据中心规模的分布式推理服务框架，旨在为大规模语言模型（LLM）、推理、多模态及视频生成工作负载提供高效的计算资源调度。该项目采用 Rust 语言构建以确保高性能，并通过 Python 实现扩展性。其核心功能包括解聚服务、智能路由、多层次KV缓存以及自动伸缩，这些特性共同作用于提升吞吐量并减少延迟。Dynamo 适用于需要跨多个GPU或节点协调服务的大规模语言模型场景，尤其是当用户希望利用KV感知路由避免重复预填充计算时，或是需要独立地扩展预填充与生成阶段的服务时。",2,"2026-06-11 03:04:19","top_language"]