[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1334":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},1334,"vllm-swift","TheTom\u002Fvllm-swift","TheTom","vLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon",null,"Python",267,17,4,7,0,1,5,12,3,49.47,"Apache License 2.0",false,"main",[],"2026-06-12 04:00:08","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Flogo.png\" alt=\"vllm-swift\" width=\"400\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  A native Swift\u002FMetal backend for \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\">vLLM\u003C\u002Fa> on Apple Silicon.\u003Cbr>\n  \u003Cb>No Python in the inference hot path.\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  Run vLLM workloads on Apple Silicon with a native Swift\u002FMetal hot path.\u003Cbr>\n  OpenAI-compatible API. Up to 2.6× faster short-context decode.\n\u003C\u002Fp>\n\n## Quick Start\n\n### 1. Install\n\n**Homebrew** (recommended for Mac power users):\n\n```bash\nbrew tap TheTom\u002Ftap && brew install vllm-swift\n```\n\n**pip** (everyone else, including dev containers and non-brew Macs):\n\n```bash\npip install vllm-swift\n```\n\nThe pip wheel bundles the prebuilt Swift bridge dylib + Metal kernel library, so no compile or brew step is required. Apple Silicon, Python 3.10+, macOS 11+.\n\n**From source:**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTheTom\u002Fvllm-swift.git && cd vllm-swift\n.\u002Fscripts\u002Finstall.sh       # builds Swift bridge, installs plugin, creates activate.sh\nsource activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)\n```\n\n### 2. Run\n\n```bash\nvllm-swift download mlx-community\u002FQwen3-4B-4bit\nvllm-swift serve ~\u002Fmodels\u002FQwen3-4B-4bit --max-model-len 4096  # increase as needed, max 40960\n```\n\n> Homebrew users don't need `activate.sh` — `vllm-swift serve` handles everything.\n\nServer running at `http:\u002F\u002Flocalhost:8000` (OpenAI-compatible API).\n\n> Drop-in replacement for vLLM on Apple Silicon. All `vllm serve` flags work unchanged.\n\n## Performance (M5 Max 128GB)\n\nDecode throughput, tok\u002Fs. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). **vllm-swift** uses the Swift\u002FMetal engine via ctypes. **vllm-metal** uses the Python\u002FMLX engine via vLLM's offline `LLM` API.\n\n### Qwen3-0.6B\n\n| | Single | 8 concurrent | 32 concurrent | 64 concurrent |\n|---|:---:|:---:|:---:|:---:|\n| **vllm-swift** | **364** | **1,527** | **2,859** | **3,425** |\n| vllm-metal (Python\u002FMLX) | 111 | 652 | 2,047 | 2,620 |\n\n### Qwen3-4B\n\n| | Single | 8 concurrent | 32 concurrent | 64 concurrent |\n|---|:---:|:---:|:---:|:---:|\n| **vllm-swift** | **147** | **477** | **1,194** | **1,518** |\n| vllm-metal (Python\u002FMLX) | 104 | 396 | 1,065 | 1,375 |\n\n> Full matrix, methodology, and long-context cells in [docs\u002FPERFORMANCE.md](docs\u002FPERFORMANCE.md).\n\n### [TurboQuant+](https:\u002F\u002Fgithub.com\u002FTheTom\u002Fturboquant_plus) KV Cache Compression\n\n[TurboQuant+](https:\u002F\u002Fgithub.com\u002FTheTom\u002Fturboquant_plus) compresses KV cache to fit longer context with modest throughput cost.\n\n**Qwen3.5 2B (4-bit weights)**\n\n| KV Cache | Compression | Prefill @1K | Decode @1K | Prefill @4K | Decode @4K |\n|----------|:-----------:|:----------:|:----------:|:----------:|:----------:|\n| FP16 | 1.0× | 1,252 tok\u002Fs | 259 tok\u002Fs | 1,215 tok\u002Fs | 249 tok\u002Fs |\n| turbo4v2 | 3.0× | 1,331 tok\u002Fs | 245 tok\u002Fs | 1,245 tok\u002Fs | 240 tok\u002Fs |\n| turbo3 | 4.6× | 1,346 tok\u002Fs | 174 tok\u002Fs | 1,276 tok\u002Fs | 241 tok\u002Fs |\n\n## Architecture\n\nThe entire forward pass runs in Swift\u002FMetal. Python is used only for orchestration.\n\n```\nPython (vLLM API, tokenization, scheduling)  ← github.com\u002Fvllm-project\u002Fvllm\n  ↓ ctypes FFI\nC bridge (bridge.h)\n  ↓ @_cdecl\nSwift (mlx-swift-lm, BatchedKVCache, batched decode)\n  ↓\nMetal GPU\n```\n\n## Features\n\n- OpenAI-compatible API (`\u002Fv1\u002Fcompletions`, `\u002Fv1\u002Fchat\u002Fcompletions`)\n- Streaming (SSE) responses\n- Chat templates (applied by vLLM, model-specific)\n- Batched concurrent decode with `BatchedKVCache` (fully batched projections + attention)\n- Per-request temperature sampling in batched path\n- Auto model download from HuggingFace Hub\n- [TurboQuant+](https:\u002F\u002Fgithub.com\u002FTheTom\u002Fturboquant_plus) KV cache compression (`turbo3`, `turbo4v2`) via mlx-swift-lm\n- Decode and prompt logprobs\n- Greedy and temperature sampling\n- EOS \u002F stop token detection (vLLM scheduler)\n- VLM (vision-language model) support (experimental)\n- Works with [Hermes](https:\u002F\u002Fgithub.com\u002Fnousresearch\u002Fhermes-agent), [OpenCode](https:\u002F\u002Fgithub.com\u002Fanomalyco\u002Fopencode), and any OpenAI-compatible client\n\n## Use with AI tools\n\n```bash\n# Start server with tool calling enabled\nvllm-swift serve ~\u002Fmodels\u002FQwen3-4B-4bit --max-model-len 40960 \\\n  --served-model-name qwen3-4b \\\n  --enable-auto-tool-choice --tool-call-parser hermes\n```\n\nThen point your tool at it:\n\n```bash\n# Hermes — set in ~\u002F.hermes\u002Fconfig.yaml:\n#   base_url: http:\u002F\u002Flocalhost:8000\u002Fv1\n#   model: qwen3-4b\n\n# OpenCode\nOPENAI_API_BASE=http:\u002F\u002Flocalhost:8000\u002Fv1 OPENAI_API_KEY=local opencode\n\n# Any OpenAI-compatible client\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"model\":\"qwen3-4b\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}]}'\n```\n\n## Configuration\n\n`vllm-swift serve` is a thin wrapper around `vllm serve` — all standard vLLM flags work. Here are the common setups:\n\n### Basic serving\n\n```bash\nvllm-swift serve ~\u002Fmodels\u002FQwen3-4B-4bit \\\n  --served-model-name qwen3-4b \\\n  --max-model-len 40960\n```\n\n### Agent \u002F tool calling (Hermes, OpenCode, etc.)\n\n```bash\nvllm-swift serve ~\u002Fmodels\u002FQwen3-4B-4bit \\\n  --served-model-name qwen3-4b \\\n  --max-model-len 40960 \\\n  --enable-auto-tool-choice --tool-call-parser hermes\n```\n\n### Chain-of-thought models (strip `\u003Cthink>` tags)\n\n```bash\nvllm-swift serve ~\u002Fmodels\u002FQwen3-4B-4bit \\\n  --served-model-name qwen3-4b \\\n  --max-model-len 40960 \\\n  --enable-reasoning --reasoning-parser deepseek_r1\n```\n\n### Long context with [TurboQuant+](https:\u002F\u002Fgithub.com\u002FTheTom\u002Fturboquant_plus)\n\nCompress KV cache 3-5× to fit longer context with modest throughput cost:\n\n```bash\nvllm-swift serve ~\u002Fmodels\u002FQwen3-4B-4bit \\\n  --served-model-name qwen3-4b \\\n  --max-model-len 40960 \\\n  --additional-config '{\"kv_scheme\": \"turbo4v2\", \"kv_bits\": 4}'\n```\n\n| Scheme | Compression | Best for |\n|--------|:-----------:|----------|\n| `turbo4v2` | ~3× | Recommended — best quality\u002Fcompression balance |\n| `turbo3` | ~4.6× | Maximum compression, higher PPL trade-off |\n\n### Full setup (agent + reasoning + TurboQuant+)\n\n```bash\nvllm-swift serve ~\u002Fmodels\u002FQwen3-4B-4bit \\\n  --served-model-name qwen3-4b \\\n  --max-model-len 40960 \\\n  --enable-auto-tool-choice --tool-call-parser hermes \\\n  --enable-reasoning --reasoning-parser deepseek_r1 \\\n  --additional-config '{\"kv_scheme\": \"turbo4v2\", \"kv_bits\": 4}'\n```\n\n### All flags\n\n```bash\nvllm-swift serve \u003Cmodel> [options]\n\n  --served-model-name NAME   Clean model name for API clients (recommended)\n  --max-model-len N          Max sequence length (default: model config)\n  --port PORT                API server port (default: 8000)\n  --gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)\n  --dtype float16            Model dtype (default: float16)\n  --enable-auto-tool-choice  Enable tool\u002Ffunction calling\n  --tool-call-parser NAME    Tool call format (hermes, llama3, mistral, etc.)\n  --enable-reasoning         Enable chain-of-thought parsing\n  --reasoning-parser NAME    Reasoning format (deepseek_r1, etc.)\n  --additional-config JSON   Extra config (kv_scheme, kv_bits)\n```\n\nAll standard [vLLM flags](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fserving\u002Fopenai_compatible_server.html) work — these are just the most common ones.\n\n## Documentation\n\n| Doc | What's in it |\n|---|---|\n| [docs\u002FPERFORMANCE.md](docs\u002FPERFORMANCE.md) | Full perf matrix vs vllm-metal, methodology, long-context cells |\n| [docs\u002FMODEL_COMPATIBILITY.md](docs\u002FMODEL_COMPATIBILITY.md) | Empirical pass \u002F soft-fail \u002F hard-fail across local MLX models with root-cause classification (model intrinsic, vLLM upstream, env-missing) |\n| [docs\u002FTROUBLESHOOTING.md](docs\u002FTROUBLESHOOTING.md) | Symptom → diagnostic → fix for known failure patterns (parser mismatch, reasoning consuming the turn, Gemma-4 boot failure, etc.) |\n| [CHANGELOG.md](CHANGELOG.md) | Release history |\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for release history.\n\n## Known Limitations (early development)\n\n- **LoRA** not supported (Swift engine limitation)\n- **Chunked prefill** disabled (Swift engine handles full sequences)\n- **top_p sampling** not supported in batched decode path (temperature works)\n- Only **Qwen3** models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)\n- Requires macOS on Apple Silicon (no Linux\u002FCUDA)\n\n## Install\n\n### Homebrew\n\n```bash\nbrew tap TheTom\u002Ftap && brew install vllm-swift\n```\n\nPrebuilt bottle — no Swift toolchain needed. First run of `vllm-swift serve` sets up a managed Python environment automatically.\n\nTo update to the latest version:\n\n```bash\nvllm-swift update\n\n# Or via standard Homebrew (works from any version):\nbrew update && brew upgrade vllm-swift\n```\n\n### From source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTheTom\u002Fvllm-swift.git\ncd vllm-swift\n.\u002Fscripts\u002Finstall.sh       # builds Swift, installs plugin, creates activate.sh\nsource activate.sh         # sets DYLD_LIBRARY_PATH\nvllm serve ~\u002Fmodels\u002FQwen3-4B-4bit --max-model-len 4096\n```\n\n### Manual (full control)\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTheTom\u002Fvllm-swift.git && cd vllm-swift\ncd swift && swift build -c release && cd ..\npip install -e .\nDYLD_LIBRARY_PATH=swift\u002F.build\u002Farm64-apple-macosx\u002Frelease \\\n  vllm serve ~\u002Fmodels\u002FQwen3-4B-4bit --max-model-len 4096\n```\n\n### Troubleshooting\n\n**Homebrew checksum error on reinstall:**\n```bash\nbrew uninstall vllm-swift && brew untap TheTom\u002Ftap\nrm -rf $(brew --cache)\u002Fdownloads\u002F*vllm*\nbrew tap TheTom\u002Ftap && brew install vllm-swift\n```\n\n**\"No module named vllm\" or plugin not loading after brew install:**\n```bash\nbrew uninstall vllm-swift && brew untap TheTom\u002Ftap\nrm -rf $(brew --cache)\u002Fdownloads\u002F*vllm* ~\u002F.vllm-swift\nbrew tap TheTom\u002Ftap && brew install vllm-swift\nvllm-swift setup\n```\n\n**vLLM build error (Apple Clang parentheses):** Our install script and brew wrapper handle this automatically. If you're on an older bottle or installing vLLM manually:\n```bash\n# Brew users: get the latest bottle first\nbrew uninstall vllm-swift && brew untap TheTom\u002Ftap\nrm -rf $(brew --cache)\u002Fdownloads\u002F*vllm* ~\u002F.vllm-swift\u002Fvenv\nbrew tap TheTom\u002Ftap && brew install vllm-swift && vllm-swift setup\n\n# Or install vLLM manually with the fix\nCFLAGS=\"-Wno-parentheses\" pip install vllm\n```\n\n**activate.sh not found:** Make sure you run `.\u002Finstall.sh` (or `.\u002Fscripts\u002Finstall.sh`) first — it generates `activate.sh` in the project root.\n\n**Metal kernel not found (GDN\u002FTurboFlash models):** The `mlx.metallib` file must be in the same directory as `libVLLMBridge.dylib`. For manual installs, copy it:\n```bash\ncp swift\u002F.build\u002Farm64-apple-macosx\u002Frelease\u002Fmlx.metallib \\\n   $(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))\u002F\n```\n\n### Download a model\n\n```bash\nvllm-swift download mlx-community\u002FQwen3-4B-4bit\n\n# Or manually:\nhuggingface-cli download mlx-community\u002FQwen3-4B-4bit --local-dir ~\u002Fmodels\u002FQwen3-4B-4bit\n\n# Already have models in HuggingFace cache? Point directly at them:\nvllm-swift serve ~\u002F.cache\u002Fhuggingface\u002Fhub\u002Fmodels--mlx-community--Qwen3-4B-4bit\u002Fsnapshots\u002Flatest\n```\n\n## Project Structure\n\n```\nvllm_swift\u002F           Python plugin (vLLM WorkerBase)\nswift\u002F\n  Sources\u002FVLLMBridge\u002F       C bridge (@_cdecl exports)\n  bridge.h                  C API (prefill, decode, batched decode)\nscripts\u002F\n  install.sh                One-step build + install\n  build_bottle.sh           Build + upload Homebrew bottle\n  integration_test.sh       End-to-end smoke test\nhomebrew\u002F\n  vllm-swift.rb             Homebrew formula\ntests\u002F                      84 tests, 97% coverage\n```\n\n## Requirements\n\n- macOS 14+ on Apple Silicon\n- Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)\n- Python 3.10+\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) 0.19+\n- [mlx-swift-lm](https:\u002F\u002Fgithub.com\u002FTheTom\u002Fmlx-swift-lm\u002Ftree\u002Fvllm-swift-stable) (pulled automatically by Swift Package Manager)\n\n## License\n\nApache-2.0\n","vLLM-Swift 是一个专为苹果芯片设计的高性能大语言模型推理引擎，通过 Swift 和 Metal 技术提供原生支持。其核心功能包括无 Python 的推理热路径、与 OpenAI 兼容的 API 以及最高可达 2.6 倍的短上下文解码速度提升。该项目利用了苹果设备上的硬件加速能力，特别适合在搭载 M 系列芯片的 Mac 上运行复杂的语言模型任务，如 Qwen3-4B-4bit，以实现更快的响应时间和更高的吞吐量。此外，它还支持 TurboQuant+ KV 缓存压缩技术，在保持性能的同时扩展了上下文长度。",2,"2026-06-11 02:43:08","CREATED_QUERY"]