[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80752":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":13,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},80752,"vibe-serve","uw-syfi\u002Fvibe-serve","uw-syfi","Can AI Agents Build Bespoke LLM Serving Systems?","",null,"Python",64,12,3,35,0,15,50.84,"MIT License",false,"main",true,[24,25],"agent","llm-serving","2026-06-11 04:07:14","# VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.06068-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06068)\n\n**An agentic loop that synthesizes bespoke LLM serving systems — one per (model, hardware, workload) target — instead of forcing every deployment through a single general-purpose runtime.**\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Ffigures\u002Fidea.png\" width=\"85%\" alt=\"Generic serving today vs. VibeServe's per-target bespoke systems\">\n\u003C\u002Fp>\n\n## Updates\n\n- **2026-05** — Blog post: [Let AI Agents Write Your Serving Stack with VibeServe](https:\u002F\u002Fsyfi.cs.washington.edu\u002Fblog\u002F2026-05-12-introducing-vibeserve\u002F).\n- **2026-05** — Paper released on arXiv: [2605.06068](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06068).\n\n## Introduction\n\nVibeServe explores a new approach to LLM serving: instead of relying on one general-purpose runtime to support every model, workload, and hardware target, we use AI agents to generate bespoke serving systems for each deployment scenario. The project asks whether long-horizon coding agents can synthesize complete LLM serving stacks end-to-end, including scheduling, caching, runtime logic, correctness checks, and performance optimizations tailored to a specific target.\n\nThe system is organized as a multi-agent optimization loop. An outer loop plans the search over system designs using persistent state such as issues, memory, and git history, while an inner loop implements candidate systems, validates correctness against a reference implementation, and measures performance on the target benchmark. Across standard and non-standard serving scenarios, VibeServe matches highly optimized systems like vLLM in mainstream deployments and achieves substantial gains in specialized settings involving predicted-output decoding, hybrid prompt caching, streaming ASR, constrained JSON decoding, multimodal inference, and Apple Silicon deployment.\n\n## Architecture\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Ffigures\u002Farchitecture.png\" width=\"90%\" alt=\"VibeServe architecture: outer loop dispatches per-round tasks to an inner loop of Implementer \u002F Accuracy Judge \u002F Performance Evaluator agents\">\n\u003C\u002Fp>\n\nThe framework factors the work along two axes:\n\n- **Outer loop** — a search policy operating over a git-recorded history of validated checkpoints. It picks the next optimization, dispatches one concrete task to the inner loop, and updates persistent planning state (issues, long-term memory file, commit graph). \n- **Inner loop** — three role-specialized coding-agent invocations on a shared workspace:\n  - *Implementer* writes\u002Fedits the candidate serving system.\n  - *Accuracy Judge* runs the user-supplied checker against the reference and inspects diffs\u002Fruntime behavior for reward-hacking patterns; only correct candidates exit the inner loop.\n  - *Performance Evaluator* profiles the implementation (Nsight Systems, PyTorch profiler) and feeds bottleneck hints back to the policy.\n- **Skills library** — Agent Skills entries distilled from existing serving engines and research literature (continuous batching, paged-KV, FlashInfer\u002FFlashAttention, MLX, hybrid-cache management, …). New model families, hardware platforms, and optimization techniques are added by writing a skill, not by modifying the framework.\n- **Execution environment** — an isolated workspace that mounts the user-provided artifacts read-only (so the Implementer cannot edit the checker or reference) and exposes the target hardware (local CUDA, Modal, Docker, or Apple Silicon) plus profilers.\n\nEach candidate is a git commit; the outer loop only advances on Judge-validated implementations, so incorrect candidates can never derail subsequent rounds.\n\n## Installation\n\nRequires Python 3.11+.\n\n```bash\nuv sync\ncp .env.example .env       # provider keys (Anthropic \u002F OpenAI \u002F Vertex \u002F …)\ncp agent.toml.example agent.toml\n```\n\n## Quickstart\n\n```bash\n# Issue-tracker outer loop, Codex CLI, Docker on local CUDA, 4 rounds\nvibe-serve \\\n  --ref examples\u002Fmoonshine-streaming\u002Freference \\\n  --acc-checker examples\u002Fmoonshine-streaming\u002Faccuracy_checker \\\n  --bench examples\u002Fmoonshine-streaming\u002Fbenchmark \\\n  --exp-name my-experiment \\\n  --docker \\\n  --agent-backend cli --cli-provider codex \\\n  --max-rounds 4 \\\n  --modality speech_to_text\n```\n\n`--outer-loop` defaults to `agent`.  Pass `--outer-loop plain` or `--outer-loop evolve` to switch.  See `vibe-serve --outer-loop \u003Ckind> --help` for loop-specific flags.\n\nSee `vibe-serve --outer-loop \u003Ckind> --help` for loop-specific flags.\n\nA separate entry point exposes the issue MCP server used by the plain loop:\n\n```bash\nvibe-serve-issue-mcp                         # serves issues.json over MCP\n```\n\n## Per-target inputs\n\nEach evaluation target lives under `examples\u002F\u003Cname>\u002F`:\n\n```\nexamples\u002F\u003Cname>\u002F\n├── OBJECTIVE.md          # free-form deployment goal (model + hardware + workload + interface)\n├── reference\u002F            # reference HuggingFace Transformers implementation\n│   ├── reference.py\n│   ├── config.json\n│   └── meta.json         # model id + revision\n├── accuracy_checker\u002F     # checker.py + tests\u002Fdata — the correctness gate\n├── benchmark\u002F            # benchmark.py + load levels — emits the metric to optimize\n└── README.md             # human-readable description\n```\n\n`OBJECTIVE.md` is read at the start of every run and must live next to `--ref` (sibling, not inside). See `examples\u002FLlama-3-8B\u002F`, `examples\u002Fmoonshine-streaming\u002F`, `examples\u002Fqwen3-32b-code-edit\u002F`, `examples\u002Folmo-hybrid-prefix-caching\u002F`, `examples\u002FLlama-3.1-8B-Instruct-MLX-8bit\u002F`, `examples\u002Fshow-o2-1.5B-HQ-h100\u002F`, and `examples\u002Fshow-o2-1.5B-HQ-macbook\u002F` for the paper scenarios.\n\nFor multi-objective evolutionary runs, drop an `objectives.toml` next to `OBJECTIVE.md` (or pass `--objective name:max|min` flags) — see `vibe-serve --outer-loop evolve --help`.\n\n## Configuration (`agent.toml`)\n\n```toml\n[model]\nname = \"claude-sonnet-4-6\"   # auto-detected provider for claude-* \u002F gpt-* \u002F gemini-*\n# provider = \"anthropic\"     # optional override\n\n[backend]\nname = \"cuda\"                 # or \"metal\" for Apple Silicon (local exec only)\n\n[agent]\nbackend = \"cli\"               # \"cli\" (codex\u002Fclaude\u002Fgemini\u002Fopencode) or \"deepagents\"\ncli_provider = \"codex\"        # which coding-agent harness to drive\n```\n\nProvider credentials live in `.env` — see `.env.example`. The CLI flags `--agent-backend` \u002F `--cli-provider` \u002F `--backend` override these.\n\n## Skills library\n\n`resources\u002Fskills\u002Fserving-systems\u002F` contains the Agent Skills entries the inner loop's agents read at runtime: model architectures, serving algorithms, programming frameworks, backend libraries, hardware platforms, and reference engines. New optimization techniques and model families enter as new skill entries; the framework itself is target-agnostic.\n\n## Outputs\n\nEvery run creates `exp_env\u002F\u003Ctimestamp>-\u003Cname>\u002F`:\n\n```\nexp_env\u002F\u003Crun>\u002F\n├── workspace\u002F                # the unified, git-tracked workspace (each round = one commit)\n├── logs\u002F\n│   ├── run-*.log             # top-level run log\n│   ├── run-*-roundNNN.log    # per-round agent log (agent loop)\n│   ├── progress.md           # long-term memory file the Orchestrator reads\u002Fedits\n│   ├── rounds.json           # per-round audit\n│   ├── state.json            # cursor (plain loop)\n│   ├── issues.json           # IssueBoard (plain loop)\n│   ├── population.json       # Individual list (evolve loop)\n│   └── docker.log\n└── reference\u002F                # snapshot of --ref at start\n```\n\nResume any run with `--resume` (defaults to \"latest\"):\n\n```bash\nvibe-serve --resume                  # newest run\nvibe-serve --resume 20260507-...     # specific dir\n```\n\n## Repository layout\n\n```\nsrc\u002Fvibe_serve\u002F\n├── cli.py                        # single entry point: `vibe-serve`\n├── context.py                    # _RunContext: lifecycle + ctx.invoke()\n├── agent_runner.py               # invoke wrappers + structured-response extraction\n├── prompts.py                    # Jinja + backend-fragment renderer\n├── schemas.py                    # Pydantic response schemas\n├── llm_client.py                 # LLM client factory\n├── config.py \u002F constants.py\n│\n├── loops\u002F                        # the three outer-loop search policies\n│   ├── agent\u002F                    # issue-tracker (Orchestrator-driven)\n│   ├── plain\u002F                    # Ralph-style queue-drain\n│   ├── evolve\u002F                   # population-based\n│   └── profiler.py               # shared Performance Evaluator helper\n│\n├── sandbox\u002F                      # execution-environment policy\n│   ├── docker_sandbox.py\n│   ├── modal_sandbox.py\n│   ├── modal_model_setup.py\n│   └── run_environment.py\n│\n├── agents\u002F                       # coding-agent harness abstraction\n│   └── callbacks.py              # LangChain logger (deepagents path)\n└── backends\u002F                     # cuda \u002F metal compute backends\n\nexamples\u002F                         # six paper scenarios + nsys\u002Ftorch profiler skills\nresources\u002Fskills\u002Fserving-systems\u002F # Agent Skills library\n```\n\n- **agent**: pre-round → profiler → orchestrator plan → implementer\u002Fjudge\n  retry up to `--max-retries-per-round` (default 3).  Always exhausts\n  `--max-rounds`; supports `revert_to_round` mid-loop.\n- **plain**: drain `IssueBoard` (one impl + one judge per issue, BLOCK\n  after `--max-attempts-per-issue`) → `perf_eval` (may file new issues).\n  Early-exits when queue is empty and `perf_eval` files nothing.\n- **evolve**: per generation × child: select parent (Pareto frontier with\n  `--frontier-bias`, scalar softmax otherwise) + inspirations →\n  `git checkout` parent tree → mutator → judge → profiler → commit.\n  No early stop; runs the full `--max-generations × --children-per-generation`.\n\n## Development\n\n```bash\nuv run pytest                                       # full suite\nuv run pytest tests\u002Floops\u002Fplain\u002Ftest_plain_loop.py  # one file\nuv run pytest -k orchestrator                       # by keyword\n```\n\n## Citation\n\nIf you use VibeServe in your research, please cite:\n\n```bibtex\n@misc{kamahori2026vibeserveaiagentsbuild,\n      title={VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?},\n      author={Keisuke Kamahori and Shihang Li and Simon Peter and Baris Kasikci},\n      year={2026},\n      eprint={2605.06068},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06068},\n}\n```\n","VibeServe 是一个利用 AI 代理构建定制化大语言模型（LLM）服务系统的项目。其核心功能是通过多代理优化循环，为每个特定的模型、硬件和工作负载生成专属的服务系统，而不是使用单一的通用运行时。技术上，VibeServe 包含一个外层循环来规划系统设计搜索，并有一个内层循环负责实现候选系统、验证正确性和评估性能。这种架构使得 VibeServe 能够在标准及非标准服务场景中达到与高度优化系统如 vLLM 相当的表现，并在某些特定环境下显著提升性能，例如预测输出解码、混合提示缓存等。该项目适合需要高效且针对性强的大规模语言模型部署解决方案的场景。",2,"2026-06-11 04:01:53","CREATED_QUERY"]