[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75815":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":14,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},75815,"llm-d-router","llm-d\u002Fllm-d-router","llm-d","llm-d Router: The intelligent entry point for inference requests",null,"https:\u002F\u002Fgithub.com\u002Fllm-d\u002Fllm-d-router","Go",216,235,28,189,0,5,8,15,7.12,false,"main",[24,25,26,27,28],"inference","kubernetes","networking","ai","gateway-api","2026-06-12 02:03:36","[![Go Report Card](https:\u002F\u002Fgoreportcard.com\u002Fbadge\u002Fgithub.com\u002Fllm-d\u002Fllm-d-inference-scheduler)](https:\u002F\u002Fgoreportcard.com\u002Freport\u002Fgithub.com\u002Fllm-d\u002Fllm-d-inference-scheduler)\n[![Go Reference](https:\u002F\u002Fpkg.go.dev\u002Fbadge\u002Fgithub.com\u002Fllm-d\u002Fllm-d-inference-scheduler.svg)](https:\u002F\u002Fpkg.go.dev\u002Fgithub.com\u002Fllm-d\u002Fllm-d-inference-scheduler)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fllm-d\u002Fllm-d-inference-scheduler)](\u002FLICENSE)\n[![Join Slack](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin_Slack-blue?logo=slack)](https:\u002F\u002Fllm-d.slack.com\u002Farchives\u002FC08SBNRRSBD)\n\n# llm-d Router\n\n> [!IMPORTANT]\n> **Terminology Change**: The *Inference Scheduler* has been renamed to **llm-d Router**; see [Terminology](README.md#terminology).\n\n> [!IMPORTANT]\n> **API & Code Consolidation**: Core Endpoint Picker (EPP) code and the `InferenceObjective` and `InferenceModelRewrite` APIs have been merged into this repository from [Gateway API Inference Extension (GIE)]. The GIE repository now exclusively hosts the `InferencePool` API—an extension of the [Kubernetes Gateway API]—and defines the Endpoint Picker Protocol.\n\nThe **llm-d Router** is the intelligent entry point for inference traffic, delivering LLM load and prefix-cache aware routing, request prioritization, and advanced flow control across diverse request formats to fulfill complex serving objectives. It supports a flexible deployment model: it can run in **Standalone Mode** (where a self-managed Envoy proxy runs alongside the EPP in the same pod) or integrate with L7 load balancers—including self-managed instances (e.g., Istio, AgentGateway) and cloud-managed services (e.g., Google Cloud's Application Load Balancer)—via the Kubernetes Gateway API. \n\nThe router achieves its intelligence through an **Endpoint Picker (EPP)** that integrates with production-grade proxies (such as [Envoy]) via the [ext-proc] protocol, injecting real-time signals into the data plane to optimize request placement.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Fllm-d-router.svg\" width=\"800\" alt=\"llm-d Router Architecture\">\n\u003C\u002Fp>\n\n## Core Components and APIs\n\nThis repository hosts the following core components:\n\n- **Endpoint Picker (EPP)**: The intelligent routing engine that serves as the \"brain\" of the router. It evaluates incoming requests against the current state of the [InferencePool], considering factors like KV-cache locality, current load, and priority to make optimal placement decisions. It integrates with L7 proxies via the `ext-proc` protocol.\n- **Request Management APIs**: These resources directly influence the EPP's request handling behavior:\n    - **InferenceObjective**: Configures the EPP's scheduling goals for specific requests, including priority levels and performance targets.\n    - **InferenceModelRewrite**: Directs the EPP to perform model name rewriting, enabling flexible traffic management for A\u002FB testing and canary rollouts.\n- **Disaggregation Sidecar**: A coordination component deployed alongside model servers (typically as a sidecar to the decode worker). It orchestrates complex multi-stage inference lifecycles, such as **P\u002FD (Prefill\u002FDecode)** and **E\u002FP\u002FD (Encode\u002FPrefill\u002FDecode)**, by communicating with specialized encode and prefill workers to manage KV-cache and embedding transfers. For more details, see the [Disaggregation Documentation].\n\n## Modes of Operation\n\nThe llm-d Router supports two primary deployment modes as specified in the [Kubernetes Gateway API Inference Extensions]:\n\n### 1. Standalone Mode\nA lightweight deployment where a self-managed Envoy proxy runs alongside the EPP in the same pod. This mode is ideal for clusters without Gateway API infrastructure or for basic testing and local evaluations.\n\n### 2. Gateway Mode (Inference Gateway)\nThe recommended mode for production environments, leveraging the official [Gateway API]. In this mode, the EPP acts as a backend for an `InferencePool`, which is referenced by an `HTTPRoute` on a shared `Gateway`. This enables advanced traffic management, multi-cluster load balancing, and shared infrastructure for both inference and traditional workloads.\n\nFor more details on the router architecture, routing logic, and different plugins (filters and scorers), see the [Architecture Documentation].\n\n---\n\n> [!NOTE]\n> The project provides tools for automatic Envoy installation. However, if you install or\n> configure it yourself, please note that the only supported [request_body_mode and response_body_mode](https:\u002F\u002Fwww.envoyproxy.io\u002Fdocs\u002Fenvoy\u002Flatest\u002Fapi-v3\u002Fservice\u002Fext_proc\u002Fv3\u002Fexternal_processor.proto)\n> is `FULL_DUPLEX_STREAMED`\n\n## Terminology\n\nTo ensure clarity across the project, we use the following standard terminology:\n\n- **llm-d Router**: The complete intelligent entry point, comprising both the **Proxy** (e.g., Envoy) and the **Endpoint Picker (EPP)**. This term replaces \"Inference Scheduler\" in all contexts.\n- **llm-d Endpoint Picker (EPP)**: The specific component that implements the routing intelligence and scoring logic. Use this term when referring to capabilities or configurations specific to the EPP itself, rather than the request routing system as a whole.\n- **Inference Gateway**: A synonym for the **llm-d Router** when operating in **Gateway Mode**.\n- **Request Scheduler**: A sub-component within the EPP responsible for the queuing and dispatching of requests.\n\n[Kubernetes]:https:\u002F\u002Fkubernetes.io\n[Kubernetes Gateway API]:https:\u002F\u002Fgateway-api.sigs.k8s.io\u002F\n[Architecture Documentation]:docs\u002Farchitecture.md\n[Disaggregation Documentation]:docs\u002Fdisaggregation.md\n[InferencePool]:https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fgateway-api-inference-extension\n[Gateway API Inference Extension (GIE)]:https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fgateway-api-inference-extension\n[Kubernetes Gateway API Inference Extensions]:https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fgateway-api-inference-extension\n[Gateway API]:https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fgateway-api\n[Envoy]:https:\u002F\u002Fgithub.com\u002Fenvoyproxy\u002Fenvoy\n[ext-proc]:https:\u002F\u002Fwww.envoyproxy.io\u002Fdocs\u002Fenvoy\u002Flatest\u002Fconfiguration\u002Fhttp\u002Fhttp_filters\u002Fext_proc_filter\n\n## Contributing\n\nStart with the [llm-d organization contributing guide][org-contributing] for project-wide guidelines, code of conduct, and community resources.\n\nOur community meeting is bi-weekly at Wednesday 10AM PDT ([Google Meet], [Meeting Notes]).\n\nWe currently utilize the [#sig-router] channel in llm-d Slack workspace for communications.\n\nFor large changes please [create an issue] first describing the change so the\nmaintainers can do an assessment, and work on the details with you. See\n[DEVELOPMENT.md](DEVELOPMENT.md) for details on how to work with the codebase.\n\nContributions are welcome!\n\n[org-contributing]:https:\u002F\u002Fgithub.com\u002Fllm-d\u002Fllm-d\u002Fblob\u002Fmain\u002FCONTRIBUTING.md\n[create an issue]:https:\u002F\u002Fgithub.com\u002Fllm-d\u002Fllm-d-inference-scheduler\u002Fissues\u002Fnew\n[discussion]:https:\u002F\u002Fgithub.com\u002Fllm-d\u002Fllm-d-inference-scheduler\u002Fdiscussions\u002Fnew?category=q-a\n[Slack]:https:\u002F\u002Fllm-d.slack.com\u002F\n[Google Meet]:https:\u002F\u002Fmeet.google.com\u002Fzij-zekm-jvt\n[Meeting Notes]:https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1Pf3x7ZM8nNpU56nt6CzePAOmFZ24NXDeXyaYb565Wq4\n[#sig-router]:https:\u002F\u002Fllm-d.slack.com\u002F?redir=%2Fmessages%2Fsig-router\n","llm-d Router 是一个智能的推理请求入口点，旨在优化大型语言模型（LLM）的负载均衡、前缀缓存感知路由、请求优先级排序及流量控制。该项目采用Go语言开发，通过Endpoint Picker (EPP)与生产级代理如Envoy集成，利用ext-proc协议实时注入信号以优化请求分配。它支持独立模式部署或通过Kubernetes Gateway API与七层负载均衡器集成，包括自管理实例和云服务。适合需要高效管理和调度复杂推理任务的场景，特别是当面对多样化请求格式且需满足严格服务质量要求时。",2,"2026-06-11 03:53:25","trending"]