[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-87":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":44,"readmeContent":45,"aiSummary":46,"trendingCount":16,"starSnapshotCount":16,"syncStatus":47,"lastSyncTime":48,"discoverSource":49},87,"Rapid-MLX","raullenchai\u002FRapid-MLX","raullenchai","The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider.",null,"https:\u002F\u002Fgithub.com\u002Fraullenchai\u002FRapid-MLX","Python",2728,338,57,27,0,51,96,592,153,29.59,false,"main",[25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43],"apple-silicon","fastapi","inference","llm","local-llm","macos","mlx","openai-api","python","tool-calling","hacktoberfest","ollama-alternative","m1","m2","m3","qwen","deepseek","claude-code","cursor","2026-06-12 02:00:07","\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fraullenchai\u002FRapid-MLX\u002Fmain\u002Fdocs\u002Fassets\u002Flogo.png\" alt=\"Rapid-MLX\" width=\"200\">\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">Rapid-MLX\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>Run AI on your Mac. Faster than anything else.\u003C\u002Fstrong>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg\" alt=\"License\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-blue.svg\" alt=\"Python 3.10+\">\u003C\u002Fa>\n  \u003Ca href=\"tests\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftests-2100%2B-brightgreen.svg\" alt=\"Tests\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fsupport.apple.com\u002Fen-us\u002FHT211814\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FApple_Silicon-M1%20|%20M2%20|%20M3%20|%20M4-black.svg?logo=apple\" alt=\"Apple Silicon\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fraullenchai\u002FRapid-MLX\u002Fstargazers\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fraullenchai\u002FRapid-MLX?style=social\" alt=\"GitHub stars\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  Run local AI models on your Mac — no cloud, no API costs. Works with Cursor, Claude Code, and any OpenAI-compatible app.\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fraullenchai\u002FRapid-MLX\u002Fmain\u002Fdocs\u002Fassets\u002Fdemo.gif\" alt=\"Rapid-MLX demo — install, serve Gemma 4, chat, tool calling\" width=\"700\">\n  \u003Cbr>\n  \u003Cem>pip install → serve Gemma 4 26B → chat + tool calling → works with PydanticAI, LangChain, Aider, and more.\u003C\u002Fem>\n\u003C\u002Fp>\n\n| | Your Mac | Model | Speed (tok\u002Fs = words\u002Fsec) | What works |\n|:---|:---:|:---:|:---:|:---:|\n| **16 GB** MacBook Air | Qwen3.5-4B | 160 tok\u002Fs | Chat, coding, tools |\n| **32+ GB** Mac Mini \u002F Studio | Nemotron-Nano 30B | 141 tok\u002Fs | 🆕 Fastest 30B, 100% tools |\n| **32+ GB** Mac Mini \u002F Studio | Qwen3.6-35B | 95 tok\u002Fs | 256 experts, 262K context |\n| **64 GB** Mac Mini \u002F Studio | Qwen3.5-35B | 83 tok\u002Fs | Best balance of smart + fast |\n| **96+ GB** Mac Studio \u002F Pro | Qwen3.5-122B | 57 tok\u002Fs | Frontier-level intelligence |\n| **128+ GB** Mac Studio Ultra | 🆕 DeepSeek V4 Flash 158B-A13B | 31-56 tok\u002Fs | Day-0 frontier MoE, 1M context |\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>New to local AI? Quick glossary\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **tok\u002Fs** (tokens per second) — roughly how many words the AI generates per second. Higher = faster.\n- **4bit \u002F 8bit** — compression levels for models. 4bit uses less memory (recommended); 8bit is higher quality.\n- **TTFT** (Time To First Token) — how long before the AI starts responding.\n- **Tool calling** — the AI can call functions in your code. Used by Cursor, Claude Code, and coding assistants.\n- **OpenAI API compatible** — Rapid-MLX speaks the same language as ChatGPT's API, so any app that works with ChatGPT can work with Rapid-MLX by just changing the server address.\n- **Ollama \u002F llama.cpp** — other popular tools for running local AI. Rapid-MLX is 2-4x faster on Apple Silicon.\n\n\u003C\u002Fdetails>\n\n---\n\n## Quick Start\n\n**Step 1 — Install** (pick one):\n\n```bash\n# Homebrew (recommended — just works, no Python version issues)\nbrew install raullenchai\u002Frapid-mlx\u002Frapid-mlx\n\n# pip (requires Python 3.10+ — macOS ships 3.9, so install Python first if needed)\npip install rapid-mlx\n\n# Or one-liner with auto-setup (installs Python if needed)\ncurl -fsSL https:\u002F\u002Fraullenchai.github.io\u002FRapid-MLX\u002Finstall.sh | bash\n```\n\n> **Vision\u002Fmultimodal models** (Gemma 4, Qwen-VL, etc.) need extras: `pip install 'rapid-mlx[vision]'`. Text-only install is ~460 MB; vision adds ~322 MB. See [Optional Extras](#optional-extras) for the full list.\n\n> **\"No matching distribution\" error?** Your Python is too old. Run `python3 --version` — if it says 3.9, install a newer Python: `brew install python@3.12` then `python3.12 -m pip install rapid-mlx`\n\n**Step 2 — Serve a model:**\n```bash\nrapid-mlx serve qwen3.5-4b\n```\nFirst run downloads the model (~2.5 GB) — you'll see a progress bar. Wait for `Ready: http:\u002F\u002Flocalhost:8000\u002Fv1`.\n\n> Want vision? `pip install 'rapid-mlx[vision]'` then `rapid-mlx serve gemma-4-26b` (~14 GB).\n\n**Step 3 — Chat** (open a **second** terminal tab):\n```bash\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"model\":\"default\",\"messages\":[{\"role\":\"user\",\"content\":\"Say hello\"}]}'\n```\n\nThat's it — you now have an OpenAI-compatible AI server on `localhost:8000`. Point any app at `http:\u002F\u002Flocalhost:8000\u002Fv1` and it just works.\n\n> **Tip:** Run `rapid-mlx models` to see all available model aliases. For a smaller\u002Ffaster model, try `rapid-mlx serve qwen3.5-9b` (~5 GB).\n\n\u003Cdetails>\n\u003Csummary>More install options\u003C\u002Fsummary>\n\n**From source** (for development):\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fraullenchai\u002FRapid-MLX.git\ncd Rapid-MLX && pip install -e .\n```\n\n**Vision models** (adds torch + torchvision, ~2.5 GB extra):\n```bash\npip install 'rapid-mlx[vision]'\n```\n\n**Audio** (TTS\u002FSTT via mlx-audio):\n```bash\npip install 'rapid-mlx[audio]'\n```\n\u003C\u002Fdetails>\n\n**Try it with Python** (make sure the server is running, then `pip install openai`):\n\n```python\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\", api_key=\"not-needed\")  # any value works, no real key needed\n\nresponse = client.chat.completions.create(\n    model=\"default\",\n    messages=[{\"role\": \"user\", \"content\": \"Say hello\"}],\n)\nprint(response.choices[0].message.content)\n```\n\n---\n\n## Works With\n\n### Agent Harnesses (MHI-tested)\n\n| Harness | Type | Notes |\n|---------|------|-------|\n| [Hermes Agent](https:\u002F\u002Fgithub.com\u002FNousResearch\u002Fhermes-agent) | Agent | 62 tools, multi-turn ([test](tests\u002Fintegrations\u002Ftest_hermes.py)) |\n| [PydanticAI](https:\u002F\u002Fai.pydantic.dev) | Framework | Typed agents, structured output ([test](tests\u002Fintegrations\u002Ftest_pydantic_ai_full.py)) |\n| [LangChain](https:\u002F\u002Flangchain.com) | Framework | `ChatOpenAI`, tools, streaming ([test](tests\u002Fintegrations\u002Ftest_langchain.py)) |\n| [smolagents](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fsmolagents) | Framework | CodeAgent + ToolCallingAgent ([test](tests\u002Fintegrations\u002Ftest_smolagents_full.py)) |\n| [OpenClaude](https:\u002F\u002Fgithub.com\u002FGitlawb\u002Fopenclaude) (Anthropic SDK) | Agent | `CLAUDE_CODE_USE_OPENAI=1` ([test](tests\u002Fintegrations\u002Ftest_anthropic_sdk.py)) |\n| [Aider](https:\u002F\u002Faider.chat) | Agent | CLI edit-and-commit, architect mode ([test](tests\u002Fintegrations\u002Ftest_aider.sh)) |\n| [Goose](https:\u002F\u002Fgithub.com\u002Fblock\u002Fgoose) | Agent | Ollama provider via `OLLAMA_HOST` |\n| [Claw Code](https:\u002F\u002Fgithub.com\u002Fultraworkers\u002Fclaw-code) | Agent | OpenAI & Anthropic endpoints |\n\n### UI \u002F IDE Clients\n\n| Client | Status | Setup |\n|--------|--------|-------|\n| [Cursor](https:\u002F\u002Fcursor.com) | Compatible | Settings → OpenAI Base URL |\n| [Continue.dev](https:\u002F\u002Fcontinue.dev) | Compatible | VS Code \u002F JetBrains extension |\n| [LibreChat](https:\u002F\u002Flibrechat.ai) | Tested | Docker ([test](tests\u002Fintegrations\u002Ftest_librechat_docker.py)) |\n| [Open WebUI](https:\u002F\u002Fgithub.com\u002Fopen-webui\u002Fopen-webui) | Tested | Docker ([test](tests\u002Fintegrations\u002Ftest_openwebui.py)) |\n| Any OpenAI-compatible app | Compatible | Point at `http:\u002F\u002Flocalhost:8000\u002Fv1` |\n\n### Model-Harness Index (MHI)\n\nMHI measures how well a model works with a specific agent harness. It combines three dimensions:\n\n| Dimension | Weight | What it measures | Source |\n|---|---|---|---|\n| **Tool Calling** | 50% | Can the model+harness execute function calls correctly? | `rapid-mlx agents --test` |\n| **HumanEval** | 30% | Can the model generate correct code? | [HumanEval](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fhuman-eval) (10 tasks) |\n| **MMLU** | 20% | Does the harness degrade base knowledge? | [tinyMMLU](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FtinyBenchmarks\u002FtinyMMLU) (10 tasks) |\n\n**MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU** (scale 0-100)\n\n| Model | Best MHI | Best Harness | Tool Calling |\n|---|---|---|---|\n| **Qwopus 27B** | **92** | All (Hermes, PydanticAI, LangChain, smolagents) | 100% |\n| **Qwen3.5 27B** | **82** | Hermes \u002F PydanticAI \u002F LangChain | 100% |\n| **Llama 3.3 70B** | **83** | smolagents (text-based) | 100% |\n| **Nemotron Nano 30B** | **59** | PydanticAI \u002F LangChain | 91-93% |\n| **Gemma 4 26B** | **62** | Hermes \u002F smolagents | 100% |\n\n\u003Cdetails>\n\u003Csummary>Full MHI table (25 model-harness combinations) + methodology\u003C\u002Fsummary>\n\n**MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU** (scale 0-100)\n\nRun `rapid-mlx agents` to see all supported agents and `python3 scripts\u002Fmhi_eval.py` to compute MHI on your own setup.\n\n| Model + Harness | Tool Calling | HumanEval | MMLU | **MHI** |\n|---|---|---|---|---|\n| **Qwopus 27B** + Hermes | 100% | 80% | 90% | **92** |\n| **Qwopus 27B** + PydanticAI | 100% | 80% | 90% | **92** |\n| **Qwen3.5 27B** + Hermes | 100% | 40% | 100% | **82** |\n| **Llama 3.3 70B** + smolagents | 100% | 50% | 90% | **83** |\n| **DeepSeek-R1 32B** + smolagents | 100% | 30% | 100% | **79** |\n| **Gemma 4 26B** + Hermes | 100% | 0% | 60% | **62** |\n| **Nemotron Nano 30B** + PydanticAI | 93% | 0% | 60% | **59** |\n\n\u003C\u002Fdetails>\n\n**Quick setup for popular apps:**\n\n**Cursor:** Settings → Models → Add Model:\n```\nOpenAI API Base:  http:\u002F\u002Flocalhost:8000\u002Fv1\nAPI Key:          not-needed\nModel name:       default          (or qwen3.5-9b — either works)\n```\nCursor's agent\u002Fcomposer mode uses tool calls automatically — Rapid-MLX handles them natively with Qwen3.5 models, no extra flags needed.\n\n**Claw Code:**\n```bash\nexport OPENAI_BASE_URL=http:\u002F\u002Flocalhost:8000\u002Fv1\nexport OPENAI_API_KEY=not-needed\nclaw --model \"openai\u002Fdefault\" prompt \"summarize this repo\"\n```\n\n**OpenClaude:**\n```bash\nCLAUDE_CODE_USE_OPENAI=1 OPENAI_BASE_URL=http:\u002F\u002Flocalhost:8000\u002Fv1 \\\nOPENAI_API_KEY=not-needed OPENAI_MODEL=default openclaude -p \"hello\"\n```\n\n**Hermes Agent** (`~\u002F.hermes\u002Fconfig.yaml`):\n```yaml\nmodel:\n  provider: \"custom\"\n  default: \"default\"\n  base_url: \"http:\u002F\u002Flocalhost:8000\u002Fv1\"\n  context_length: 32768\n```\n\n**Goose:**\n```bash\nGOOSE_PROVIDER=ollama OLLAMA_HOST=http:\u002F\u002Flocalhost:8000 \\\nGOOSE_MODEL=default goose run --text \"hello\"\n```\n\n**Claude Code:**\n```bash\nOPENAI_BASE_URL=http:\u002F\u002Flocalhost:8000\u002Fv1 claude\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>More client setup instructions\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n**Continue.dev** (`~\u002F.continue\u002Fconfig.yaml`):\n```yaml\nmodels:\n  - name: rapid-mlx\n    provider: openai\n    model: default\n    apiBase: http:\u002F\u002Flocalhost:8000\u002Fv1\n    apiKey: not-needed\n```\n\n**Aider:**\n```bash\naider --openai-api-base http:\u002F\u002Flocalhost:8000\u002Fv1 --openai-api-key not-needed\n```\n\n**Swival** (`~\u002F.swival\u002Fconfig.toml`):\n```toml\n[profiles.rapidmlx]\nprovider = \"generic\"\nbase_url = \"http:\u002F\u002F127.0.0.1:8000\"\nmodel = \"default\"\n```\n\nRun with:\n```bash\nswival --profile rapidmlx \"summarize this repo\"\n```\n\n**Open WebUI** (Docker one-liner):\n```bash\ndocker run -d -p 3000:8080 \\\n  --add-host=host.docker.internal:host-gateway \\\n  -e ENABLE_OLLAMA_API=False \\\n  -e OPENAI_API_BASE_URL=http:\u002F\u002Fhost.docker.internal:8000\u002Fv1 \\\n  -e OPENAI_API_KEY=not-needed \\\n  -v open-webui:\u002Fapp\u002Fbackend\u002Fdata \\\n  --name open-webui \\\n  ghcr.io\u002Fopen-webui\u002Fopen-webui:main\n```\n\n**OpenCode** (`opencode.json` in your project root):\n```json\n{\n  \"provider\": {\n    \"openai\": {\n      \"api\": \"http:\u002F\u002Flocalhost:8000\u002Fv1\",\n      \"models\": {\n        \"default\": {\n          \"name\": \"rapid-mlx local\",\n          \"limit\": { \"context\": 32768, \"output\": 8192 }\n        }\n      },\n      \"options\": { \"apiKey\": \"not-needed\" }\n    }\n  }\n}\n```\n\n**PydanticAI** (`pip install pydantic-ai`):\n```python\nfrom pydantic_ai import Agent\nfrom pydantic_ai.models.openai import OpenAIChatModel\nfrom pydantic_ai.providers.openai import OpenAIProvider\n\nmodel = OpenAIChatModel(\n    model_name=\"default\",\n    provider=OpenAIProvider(\n        base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\",\n        api_key=\"not-needed\",\n    ),\n)\nagent = Agent(model)\nprint(agent.run_sync(\"What is 2+2?\").output)\n```\n\n**smolagents** (`pip install smolagents`):\n```python\nfrom smolagents import CodeAgent, OpenAIServerModel\n\nmodel = OpenAIServerModel(\n    model_id=\"default\",\n    api_base=\"http:\u002F\u002Flocalhost:8000\u002Fv1\",\n    api_key=\"not-needed\",\n)\nagent = CodeAgent(tools=[], model=model)\nagent.run(\"What is 5 multiplied by 7?\")\n```\n\n**LibreChat** (`librechat.yaml`, under `endpoints.custom`):\n```yaml\n- name: \"Rapid-MLX\"\n  apiKey: \"rapid-mlx\"\n  baseURL: \"http:\u002F\u002Flocalhost:8000\u002Fv1\u002F\"\n  models:\n    default: [\"default\"]\n    fetch: true\n  titleConvo: true\n  titleModel: \"current_model\"\n  modelDisplayLabel: \"Rapid-MLX\"\n```\n\n**Anthropic SDK** (`pip install anthropic`):\n```python\nfrom anthropic import Anthropic\nclient = Anthropic(base_url=\"http:\u002F\u002Flocalhost:8000\", api_key=\"not-needed\")\n\nmessage = client.messages.create(\n    model=\"default\",\n    max_tokens=1024,\n    messages=[{\"role\": \"user\", \"content\": \"Say hello\"}],\n)\nprint(message.content[0].text)\n```\n\n\u003C\u002Fdetails>\n\n---\n\n## Choose Your Model\n\n### What fits my Mac?\n\nThe model has to fit in your Mac's RAM. If your Mac slows down or Activity Monitor shows red memory pressure, pick a smaller model from the table below.\n\n| Your Mac | Best Model | RAM Used | Speed | Quality |\n|----------|-----------|---------|-------|---------|\n| **16 GB** MacBook Air\u002FPro | [Qwen3.5-4B 4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-4B-MLX-4bit) | 2.4 GB | 160 tok\u002Fs | Good for chat and simple tasks |\n| **24 GB** MacBook Pro | [Qwen3.5-9B 4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-9B-4bit) | 5.1 GB | 108 tok\u002Fs | Great all-rounder |\n| **32 GB** Mac Mini \u002F Studio | [Qwen3.5-27B 4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-27B-4bit) | 15.3 GB | 39 tok\u002Fs | Solid coding model |\n| **32 GB** Mac Mini \u002F Studio | 🆕 [Nemotron-Nano 30B 4bit](https:\u002F\u002Fhuggingface.co\u002Flmstudio-community\u002FNVIDIA-Nemotron-3-Nano-30B-A3B-MLX-4bit) | 18 GB | 141 tok\u002Fs | Fastest 30B, 100% tool calling |\n| **32 GB** Mac Mini \u002F Studio | [Qwen3.6-35B-A3B 4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.6-35B-A3B-4bit) | 20 GB | 95 tok\u002Fs | 256 MoE experts, 262K context |\n| **36 GB** MacBook Pro M3\u002FM4 Pro | [Qwen3.5-27B 4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-27B-4bit) | 15.3 GB | 39 tok\u002Fs | Same as 32 GB — extra headroom for long contexts |\n| **48 GB** Mac Mini \u002F Studio | [Qwen3.5-35B-A3B 8bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-35B-A3B-8bit) | 37 GB | 83 tok\u002Fs | **Sweet spot** — smart + fast |\n| **64 GB** Mac Mini \u002F Studio | [Qwen3.5-35B-A3B 8bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-35B-A3B-8bit) | 37 GB | 83 tok\u002Fs | Same model, more room for KV cache |\n| **96 GB** Mac Studio \u002F Pro | [Qwen3.5-122B mxfp4](https:\u002F\u002Fhuggingface.co\u002Fnightmedia\u002FQwen3.5-122B-A10B-Text-mxfp4-mlx) | 65 GB | 57 tok\u002Fs | Best model, fits comfortably |\n| **128 GB** Mac Studio \u002F Pro | 🆕 [DeepSeek V4 Flash 2-bit DQ](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FDeepSeek-V4-Flash-2bit-DQ) | 91 GB | 56 tok\u002Fs | 158B-A13B frontier MoE, day-0 (chat only) |\n| **192 GB** Mac Studio \u002F Pro | [Qwen3.5-122B 8bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-122B-A10B-8bit) | 130 GB | 44 tok\u002Fs | Maximum quality |\n| **256 GB** Mac Studio Ultra | 🆕 [DeepSeek V4 Flash 8-bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FDeepSeek-V4-Flash-8bit) | 136 GB | 31 tok\u002Fs | 158B-A13B frontier MoE, 1M context (chat only) |\n\n> **4bit vs 8bit:** 4bit models are compressed to use less memory (recommended for most users). 8bit models are higher quality but need more RAM. \"mxfp4\" is a high-quality 4bit format.\n\n### Copy-paste commands\n\nPick the one that matches your Mac. Short aliases work — run `rapid-mlx models` to see all available models.\n\n```bash\n# 16 GB — lightweight, fast\nrapid-mlx serve qwen3.5-4b --port 8000\n\n# 24 GB — best small model\nrapid-mlx serve qwen3.5-9b --port 8000\n\n# 32 GB — solid coding model\nrapid-mlx serve qwen3.5-27b --port 8000\n\n# 32 GB — Nemotron Nano (fastest 30B, 141 tok\u002Fs, NVIDIA MoE)\nrapid-mlx serve nemotron-30b --port 8000\n\n# 32+ GB — Qwen 3.6 (256 experts, 262K context)\nrapid-mlx serve qwen3.6-35b --port 8000\n\n# 64 GB — sweet spot\nrapid-mlx serve qwen3.5-35b --prefill-step-size 8192 --port 8000  # faster first response\n\n# 96+ GB — best model\nrapid-mlx serve qwen3.5-122b --prefill-step-size 8192 --port 8000\n\n# Coding agent — fast MoE, great for Claude Code \u002F Cursor\nrapid-mlx serve qwen3-coder --prefill-step-size 8192 --port 8000  # MoE = only uses part of the model, so it's fast\n\n# Vision — image understanding (see note below)\nrapid-mlx serve qwen3-vl-4b --mllm --port 8000\n```\n\n> **Vision deps:** Install into the same environment where rapid-mlx lives:\n> - `install.sh` users: `~\u002F.rapid-mlx\u002Fbin\u002Fpip install 'rapid-mlx[vision]'`\n> - `pip` users: `pip install 'rapid-mlx[vision]'` (in the same venv)\n> - `brew` users: `$(brew --prefix)\u002Fopt\u002Frapid-mlx\u002Flibexec\u002Fbin\u002Fpip install 'rapid-mlx[vision]'`\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Parser auto-detection & manual overrides\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nParsers are **auto-detected from the model name** — you don't need to specify `--tool-call-parser` or `--reasoning-parser` for supported families. Explicit flags always override auto-detection.\n\n| Model Family | Auto-detected `--tool-call-parser` | Auto-detected `--reasoning-parser` | Notes |\n|-------------|---------------------|---------------------|-------|\n| Qwen3.5 (all sizes) | `hermes` | `qwen3` | **Recommended** — 100% tool calling |\n| 🆕 Qwen3.6 | `qwen3_coder_xml` | `qwen3` | XML tool format, 262K context |\n| Qwen3-Coder-Next | `hermes` | *(none)* | Fast coding, non-thinking mode |\n| DeepSeek R1-0528 \u002F V3.1 | `deepseek_v31` | `deepseek_r1` | Dedicated V3.1 parser |\n| DeepSeek R1 (older) | `deepseek` | `deepseek_r1` | With reasoning |\n| DeepSeek V3 \u002F V2.5 | `deepseek` | *(none)* | No reasoning parser |\n| GLM-4.7 | `glm47` | *(none)* | 100% tool calling |\n| MiniMax-M2.5 | `minimax` | `minimax` | XML tool format |\n| GPT-OSS | `harmony` | `harmony` | Native format |\n| Kimi-Linear | `kimi` | *(none)* | Kimi tool format |\n| Llama 3.x | `llama` | *(none)* | JSON tool format |\n| Mistral \u002F Devstral | `hermes` | *(none)* | Hermes-compatible |\n| Gemma | `hermes` | *(none)* | Hermes-compatible |\n| Phi-3\u002F4 | `hermes` | *(none)* | Hermes-compatible |\n\nAll 17 parsers include automatic recovery — if a quantized model outputs broken tool calls as text, they're auto-converted back to structured format.\n\n\u003C\u002Fdetails>\n\n---\n\n## Benchmarks\n\nTested on **Mac Studio M3 Ultra (256GB)**. Rapid-MLX uses Apple's [MLX framework](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx) — purpose-built for unified memory with native Metal compute kernels — which is why it beats C++-based engines (Ollama, llama.cpp) on most models. Ollama numbers tested with **v0.20.4** (latest, with MLX backend).\n\n| Model | Rapid-MLX | Best Alternative | Speedup |\n|-------|----------|-----------------|---------|\n| **Phi-4 Mini 14B** | **180** tok\u002Fs | 77 (mlx-lm) \u002F 56 (Ollama) | **2.3x** \u002F **3.2x** |\n| **Qwen3.5-4B** | **160** tok\u002Fs | 155 (mlx-lm serve) | **1.0x** |\n| **Nemotron-Nano 30B** | **141** tok\u002Fs · 100% tools | — | — |\n| 🆕 **DeepSeek V4 Flash 158B-A13B** (2-bit DQ) | **56** tok\u002Fs | — (only MLX engine, day-0) | — |\n| 🆕 **DeepSeek V4 Flash 158B-A13B** (8-bit) | **31** tok\u002Fs | — (only MLX engine, day-0) | — |\n| **GPT-OSS 20B** | **127** tok\u002Fs · 100% tools | 79 (mlx-lm serve) | **1.6x** |\n| **Qwen3.5-9B** | **108** tok\u002Fs | 41 (Ollama) | **2.6x** |\n| **Qwen3.6-35B-A3B** | **95** tok\u002Fs · 100% tools | — | — |\n| **Kimi-Linear-48B** | **94** tok\u002Fs · 100% tools | — (only engine) | — |\n| **Gemma 4 26B-A4B** | **85** tok\u002Fs | 68 (Ollama) | **1.3x** |\n| **Gemma 4 E4B** | **83** tok\u002Fs | — | — |\n| **Qwen3.5-35B-A3B** | **83** tok\u002Fs · 100% tools | 75 (oMLX) | **1.1x** |\n| **Qwen3-Coder 80B** | **74** tok\u002Fs · 100% tools | 69 (mlx-lm serve) | **1.1x** |\n| **Qwen3.5-122B** | **44** tok\u002Fs · 100% tools | 43 (mlx-lm serve) | ~1.0x |\n| **Gemma 4 31B** | **31** tok\u002Fs | — | — |\n\n*Full benchmark data with all models, TTFT tables, DeltaNet snapshots, and engine comparison below.*\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>TTFT — Prompt Cache Advantage\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nPrompt cache keeps multi-turn conversations fast. For standard transformers, KV cache trimming gives sub-100ms TTFT. For hybrid RNN models (Qwen3.5 DeltaNet), we use state snapshots — the first technique to bring prompt cache to non-trimmable architectures on MLX.\n\n**Pure KV cache (transformers):**\n\n| Model | Rapid-MLX (cached) | mlx-lm serve | Speedup |\n|-------|-------------------|-------------------|---------|\n| Kimi-Linear-48B | **0.08s** | — | — |\n| Llama 3.2 3B | **0.10s** | — | — |\n| Hermes-3-Llama 8B | **0.10s** | 0.18s | 1.8x |\n| Phi-4 Mini 14B | **0.13s** | 0.15s | 1.2x |\n| Devstral-Small-2 24B | **0.13s** | 0.38s | 2.9x |\n| Mistral Small 24B | **0.13s** | 0.38s | 2.9x |\n| GLM-4.7-Flash 9B | **0.13s** | 0.23s | 1.8x |\n| GLM-4.5-Air | **0.14s** | 0.47s | 3.4x |\n| Qwen3-Coder-Next 80B | **0.16s** | 0.27s | 1.7x |\n| GPT-OSS 20B | **0.16s** | 0.27s | 1.7x |\n| Qwen3.5-9B | **0.22s** | 0.26s | 1.2x |\n| Gemma 4 E4B | **0.25s** | — (day-0) | — |\n| Gemma 4 26B-A4B | **0.25s** | — (day-0) | — |\n| Gemma 4 31B | **0.34s** | 0.57s (mlx-vlm bf16) | **1.7x** |\n\n**DeltaNet state snapshots (hybrid RNN + attention):**\n\nQwen3.5 uses Gated DeltaNet (75% RNN) + full attention (25% KV). Other engines recreate the entire cache from scratch every request — we snapshot the RNN state at the system prompt boundary, restoring in ~0.1ms instead of re-running hundreds of tokens through the recurrent layers.\n\n| Model | Cold TTFT | Snapshot TTFT | Speedup |\n|-------|-----------|---------------|---------|\n| Qwen3-Coder-Next 6bit (48L) | 0.66s | **0.16s** | **4.3x** |\n| Qwen3.5-35B-A3B 8bit (40L) | 0.49s | **0.19s** | **2.6x** |\n| Qwen3.5-27B 4bit (40L) | 0.58s | **0.27s** | **2.1x** |\n| Qwen3.5-9B 4bit (40L) | 0.27s | **0.22s** | **1.2x** |\n| Qwen3.5-4B 4bit (32L) | 0.24s | **0.16s** | **1.5x** |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Capability Comparison\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n| Feature | Rapid-MLX | oMLX | Ollama | llama.cpp | mlx-lm serve |\n|---------|-----------|------|--------|-----------|-------------|\n| **Tool calling** | 100% (Qwen\u002FGLM\u002FGPT-OSS\u002FKimi) | N\u002FA | 100% (Qwen) | 80% (Phi-4) | N\u002FA |\n| **Tool call recovery** | 100% | N\u002FA | 100% | 100% | N\u002FA |\n| **Tool injection fallback** | Yes | No | No | No | No |\n| **Think-tag leak** | 0% | N\u002FA | 0% | 0% | N\u002FA |\n| **Prompt cache** | KV + DeltaNet | No | No | No | No |\n| **Vision** | Yes | Yes | Yes | No | No |\n| **Audio (STT\u002FTTS)** | Yes | No | No | No | No |\n| **17 tool parsers** | Yes | No | No | No | No |\n| **Cloud routing** | Yes | No | No | No | No |\n| **Streaming** | Yes | Yes | Yes | Yes | Yes |\n| **OpenAI API** | Yes | Yes | Yes | Yes | Yes |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Optimization Techniques Per Model\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n| Technique | What it does | Models |\n|-----------|-------------|--------|\n| **KV prompt cache** | Trim KV cache to common prefix, skip re-prefill | All transformer models |\n| **DeltaNet state snapshots** | Deep-copy RNN state at prefix boundary, restore in ~0.1ms | Qwen3.5 (4B, 9B, 27B, 35B, 122B), Qwen3-Coder-Next |\n| **Hybrid cache sync** | Keep trimmable KV + non-trimmable RNN layers in sync | Qwen3.5 (Gated DeltaNet + attention) |\n| **Tool logits bias** | Jump-forward decoding — bias logits toward structured tokens | All models with `--enable-tool-logits-bias` |\n| **Auto tool recovery** | Detect broken text-format tool calls, convert to structured | All 18 parser formats (incl. Gemma 4) |\n| **TurboQuant V-cache** | Rotate + Lloyd-Max compress V cache (86% savings on dense models) | All models with `--kv-cache-turboquant` |\n| **KV cache quantization** | Quantize prefix cache entries to reduce memory | All models with `--kv-cache-quantization` |\n| **Prefill chunking** | Configurable step size for large-prompt throughput | All models |\n| **Cloud routing** | Offload high-token requests to cloud LLM when local is slow | All models with `--cloud-model` |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Eval benchmarks (20 models, 4 suites)\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nTool calling (30 scenarios), coding (HumanEval+), reasoning (MATH-500), general knowledge (MMLU-Pro). Top models:\n\n| Model | Decode | Tools | Code | Reason | General | Avg |\n|-------|--------|-------|------|--------|---------|-----|\n| Qwen3.5-122B 8bit | 44 t\u002Fs | 87% | 90% | 90% | 90% | **89%** |\n| Qwen3.5-35B 8bit | 83 t\u002Fs | 90% | 90% | 80% | 80% | **85%** |\n| Qwen3-Coder-Next 4bit | 74 t\u002Fs | 90% | 90% | 70% | 70% | **80%** |\n| Qwen3.5-27B 4bit | 39 t\u002Fs | 83% | 90% | 50% | 80% | **76%** |\n| Qwen3.5-9B 4bit | 108 t\u002Fs | 83% | 70% | 60% | 70% | **71%** |\n\nRun your own: `python scripts\u002Fbenchmark_engines.py --engine rapid-mlx ollama --runs 3`\n\n\u003C\u002Fdetails>\n\n---\n\n## Features\n\n### Tool Calling\n\nFull OpenAI-compatible tool calling with 17 parser formats and **automatic recovery when quantized models break**. Models at 4-bit degrade after multiple tool rounds — Rapid-MLX auto-detects broken output and converts it back to structured `tool_calls`.\n\n### Reasoning Separation\n\nModels with chain-of-thought (Qwen3, DeepSeek-R1) output reasoning in a separate `reasoning_content` field — cleanly separated from `content` in streaming mode. Works with Qwen3, DeepSeek-R1, MiniMax, and GPT-OSS reasoning formats.\n\n### Prompt Cache\n\nPersistent cache across requests — only new tokens are prefilled on each turn. For standard transformers, KV cache trimming. For hybrid models (Qwen3.5 DeltaNet), RNN state snapshots restore non-trimmable layers from memory instead of re-computing. 2-5x faster TTFT on all architectures. Always on, no flags needed.\n\n### Smart Cloud Routing\n\nLarge-context requests auto-route to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be slow. Routing based on new tokens after cache hit. `--cloud-model openai\u002Fgpt-5 --cloud-threshold 20000`\n\n### Multimodal\n\nVision, audio (STT\u002FTTS), video understanding, and text embeddings — all through the same OpenAI-compatible API.\n\nAlso: logprobs API, structured JSON output (`response_format`), continuous batching, KV cache quantization (`--kv-cache-quantization`), and [2100+ tests](tests\u002F).\n\n---\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Server Flags Reference\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n> You don't need any flags to get started — the defaults work for most setups. These are for advanced tuning.\n\n### Core\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `\u003Cmodel>` | HuggingFace model name, local path, or alias (positional arg) | *(required)* |\n| `--host` | Host to bind to | `0.0.0.0` |\n| `--port` | Port to bind to | `8000` |\n| `--max-tokens` | Default max tokens for generation | `32768` |\n\n### Tool Calling & Reasoning\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--tool-call-parser` | Parser: `hermes`, `minimax`, `qwen`, `llama`, `deepseek`, etc. | *(auto-detected)* |\n| `--reasoning-parser` | Parser: `qwen3`, `deepseek_r1`, `minimax`, `gpt_oss` | *(auto-detected)* |\n| `--enable-tool-logits-bias` | Jump-forward decoding for faster tool calls | off |\n\n### Performance\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--prefill-step-size` | Tokens per prefill chunk | `2048` |\n| `--kv-cache-turboquant` | TurboQuant V-cache compression (3-4 bit, 86% savings on dense models) | off |\n| `--kv-cache-quantization` | Quantize prefix cache entries for memory savings | off |\n| `--enable-prefix-cache` | Cache common prefixes across requests | off |\n| `--gpu-memory-utilization` | Fraction of device memory to use (0.0-1.0) | `0.90` |\n\n### Cloud Routing\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--cloud-model` | litellm model string (e.g. `openai\u002Fgpt-5`) | *(disabled)* |\n| `--cloud-threshold` | New token threshold to trigger cloud routing | `20000` |\n\n### Security & Other\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--api-key` | API key for authentication | *(no auth)* |\n| `--rate-limit` | Requests per minute per client | *(unlimited)* |\n| `--timeout` | Request timeout in seconds | `300` |\n| `--mllm` | Force multimodal (vision) mode | auto-detect |\n| `--mcp-config` | MCP configuration file for tool integration | *(none)* |\n| `--embedding-model` | Pre-load embedding model at startup | *(none)* |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Common Issues\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n**\"parameters not found in model\" warnings at startup** — Normal for VLMs. Vision weights are auto-skipped.\n\n**Out of memory \u002F very slow (\u003C5 tok\u002Fs)** — Model too big. Check [What fits my Mac?](#what-fits-my-mac) Try a smaller quantization (4bit) or smaller model.\n\n**Empty responses** — Remove `--reasoning-parser` for non-thinking models.\n\n**Tool calls as plain text** — Set the correct `--tool-call-parser` for your model. Even without it, Rapid-MLX auto-recovers most cases.\n\n**Other issues?** Run `rapid-mlx doctor` for self-diagnostics.\n\n**Slow first response** — Two different causes: (1) Qwen3.5 models reason before answering — add `--no-thinking` to skip reasoning for faster responses, or (2) cold start on long prompts — add `--prefill-step-size 8192` to speed up processing. Subsequent turns hit prompt cache and are 10-30x faster.\n\n**Server hangs after client disconnect** — Fixed in v0.3.0+. Upgrade to latest.\n\n\u003C\u002Fdetails>\n\n---\n\n## Optional Extras\n\nThe base `pip install rapid-mlx` is ~460 MB and covers all text-only models. Vision, audio, and other features ship as opt-in extras:\n\n| Extra | Install | Adds | What it unlocks |\n|---|---|---|---|\n| `vision` | `pip install 'rapid-mlx[vision]'` | ~322 MB | Gemma 4, Qwen-VL, video understanding (mlx-vlm + opencv + torch) |\n| `audio` | `pip install 'rapid-mlx[audio]'` | ~600 MB | TTS \u002F STT (mlx-audio + spacy + scipy) |\n| `embeddings` | `pip install 'rapid-mlx[embeddings]'` | ~50 MB | `\u002Fv1\u002Fembeddings` endpoint (mlx-embeddings) |\n| `chat` | `pip install 'rapid-mlx[chat]'` | ~150 MB | Built-in Gradio chat UI |\n| `guided` | `pip install 'rapid-mlx[guided]'` | ~80 MB | Schema-constrained JSON generation (outlines) |\n| `all` | `pip install 'rapid-mlx[all]'` | ~1.1 GB | Vision + audio + chat + embeddings |\n\nIf you installed via Homebrew and want vision\u002Faudio support, use `pip install 'rapid-mlx[vision]'` (or `[audio]`) inside your own Python 3.10+ venv — that gives you the full feature set without rebuilding the brew formula.\n\n---\n\n## Troubleshooting\n\nRun the built-in self-diagnostic (works from `pip install`, no dev tools needed):\n\n```bash\nrapid-mlx doctor\n```\n\n```\nRapid-MLX Doctor\n============================================================\n  [metal] OK        # Apple Silicon Metal GPU available\n  [imports] OK      # Core modules import cleanly\n  [cli] OK          # CLI commands respond\n  [model_load] OK   # Inference pipeline works\nResult: PASS\n```\n\n---\n\n## Development\n\n### Quick start\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fraullenchai\u002FRapid-MLX.git\ncd Rapid-MLX\npip install -e \".[dev]\"\n```\n\n### Testing\n\nTwo layers: **user-facing doctor** (ships with pip) and **dev test suite** (source checkout only).\n\n#### Dev test commands\n\n| Command | What | Time | Needs server? |\n|---------|------|------|---------------|\n| `make lint` | ruff lint | ~10s | No |\n| `make test` | pytest unit suite (2000+ tests) | ~30s | No |\n| `make smoke` | lint + unit | ~1 min | No |\n| `make stress` | 8-scenario stress test | ~5 min | Yes |\n| `make soak` | 10-min agent soak test | 10 min | Yes |\n\nFor stress\u002Fsoak, start a server first:\n```bash\nrapid-mlx serve mlx-community\u002FQwen3.5-4B-MLX-4bit --enable-auto-tool-choice --tool-call-parser hermes\n# In another terminal:\nmake stress\n```\n\nOr use the script directly for more options:\n```bash\npython scripts\u002Fdev_test.py smoke              # lint + unit\npython scripts\u002Fdev_test.py stress --port 8000 # custom port\npython scripts\u002Fdev_test.py full               # everything\n```\n\n#### Regression harness (multi-model)\n\n```bash\nmake check              # 1 model (~10 min, auto starts server)\nmake full               # 3 models + 11 agent profiles (~1 hr)\nmake benchmark          # all local models (overnight)\n```\n\n### Architecture\n\n```\nvllm_mlx\u002F\n  server.py              # App factory + model loading + CLI (1047 lines)\n  config\u002F                # ServerConfig singleton\n  service\u002F\n    helpers.py           # Shared request helpers\n    postprocessor.py     # Streaming pipeline (100% test coverage)\n  routes\u002F\n    chat.py              # \u002Fv1\u002Fchat\u002Fcompletions\n    completions.py       # \u002Fv1\u002Fcompletions\n    anthropic.py         # \u002Fv1\u002Fmessages (Anthropic API)\n    health.py, models.py, embeddings.py, audio.py, mcp_routes.py\n  engine\u002F                # BatchedEngine (continuous batching)\n  reasoning\u002F             # 7 reasoning parsers (Qwen3, DeepSeek, MiniMax, ...)\n  tool_parsers\u002F          # 20+ tool call parsers\n  agents\u002F                # 11 agent profiles (YAML)\n  runtime\u002F               # Model registry, cache persistence\n  doctor\u002F                # User self-diagnostic\nscripts\u002F                 # Dev-only (NOT shipped with pip)\n  dev_test.py            # Unified test entry point\n  stress_test.py         # 8-scenario stress test\n  agent_soak_test.py     # 10-min agent soak test\n  cross_model_stress.py  # Multi-model validation\ntests\u002F                   # pytest unit tests (2000+)\nharness\u002F                 # Regression baselines + thresholds\n```\n\n---\n\n## Roadmap\n\n| Technique | Expected Gain | Status |\n|-----------|---------------|--------|\n| [Standard Speculative Decode](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.01318) — draft model acceleration | 1.5-2.3x decode | Not started |\n| [EAGLE-3](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01840) — feature-level draft on Metal | 3-6.5x decode | Not started |\n| [ReDrafter](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09919) — Apple's RNN draft head | 1.4-1.5x decode | Not started |\n\n---\n\n## Contributing\n\nWe welcome contributions of all sizes! See [CONTRIBUTING.md](CONTRIBUTING.md) for setup and guidelines.\n\n**Easy first contributions** (no model download needed):\n- [Add a model alias](https:\u002F\u002Fgithub.com\u002Fraullenchai\u002FRapid-MLX\u002Fissues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) — map a short name to a HuggingFace model ID\n- [Request model support](https:\u002F\u002Fgithub.com\u002Fraullenchai\u002FRapid-MLX\u002Fissues\u002Fnew?template=model_support.yml) — tell us which model you want\n\n**Testing contributions** (needs a Mac with Apple Silicon):\n- Benchmark a model and share results\n- Test with your favorite AI client (Cursor, Aider, LangChain, etc.)\n- [Report a bug](https:\u002F\u002Fgithub.com\u002Fraullenchai\u002FRapid-MLX\u002Fissues\u002Fnew?template=bug_report.yml)\n\n### Contributors\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fraullenchai\u002FRapid-MLX\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Fcontrib.rocks\u002Fimage?repo=raullenchai\u002FRapid-MLX\" \u002F>\n\u003C\u002Fa>\n\n## Star History\n\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#raullenchai\u002FRapid-MLX&Date\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=raullenchai\u002FRapid-MLX&type=Date&theme=dark\" \u002F>\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=raullenchai\u002FRapid-MLX&type=Date\" \u002F>\n    \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=raullenchai\u002FRapid-MLX&type=Date\" \u002F>\n  \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n## License\n\nApache 2.0 — see [LICENSE](LICENSE).\n","Rapid-MLX 是一个专为 Apple Silicon 设计的本地AI引擎，旨在提供比其他解决方案更快的推理速度。项目使用 Python 开发，支持多种大语言模型，并通过优化实现了在Mac设备上的高效运行，如Qwen、Nemotron-Nano等，其处理速度可达每秒数百个token。它具备17种工具解析器、提示缓存及推理分离等功能，能够完全替代OpenAI API，适用于需要快速响应且成本敏感的应用场景，例如代码辅助、聊天机器人等。对于希望在Mac上离线运行AI模型而不依赖云端服务或支付API费用的用户来说，Rapid-MLX是一个理想选择。",2,"2026-06-11 02:30:50","trending"]