[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-70508":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},70508,"omlx","jundot\u002Fomlx","jundot","LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar",null,"https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx","Python",16375,1394,87,428,0,300,815,1400,900,44.43,false,"main",[25,26,27,28,29,30],"apple-silicon","inference-server","llm","macos","mlx","openai-api","2026-06-12 02:02:34","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"docs\u002Fimages\u002Ficon-rounded-dark.svg\" width=\"140\">\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"docs\u002Fimages\u002Ficon-rounded-light.svg\" width=\"140\">\n    \u003Cimg alt=\"oMLX\" src=\"docs\u002Fimages\u002Ficon-rounded-light.svg\" width=\"140\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">oMLX\u003C\u002Fh1>\n\u003Cp align=\"center\">\u003Cb>LLM inference, optimized for your Mac\u003C\u002Fb>\u003Cbr>Continuous batching and tiered KV caching, managed directly from your menu bar.\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fwww.buymeacoffee.com\u002Fjundot\">\u003Cimg src=\"https:\u002F\u002Fcdn.buymeacoffee.com\u002Fbuttons\u002Fv2\u002Fdefault-yellow.png\" alt=\"Buy Me A Coffee\" height=\"40\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue\" alt=\"License\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-green\" alt=\"Python 3.10+\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fplatform-Apple%20Silicon-black?logo=apple\" alt=\"Apple Silicon\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"mailto:junkim.dot@gmail.com\">junkim.dot@gmail.com\u003C\u002Fa> · \u003Ca href=\"https:\u002F\u002Fomlx.ai\u002Fme\">https:\u002F\u002Fomlx.ai\u002Fme\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#install\">Install\u003C\u002Fa> ·\n  \u003Ca href=\"#quickstart\">Quickstart\u003C\u002Fa> ·\n  \u003Ca href=\"#features\">Features\u003C\u002Fa> ·\n  \u003Ca href=\"#models\">Models\u003C\u002Fa> ·\n  \u003Ca href=\"#cli-configuration\">CLI Configuration\u003C\u002Fa> ·\n  \u003Ca href=\"https:\u002F\u002Fomlx.ai\u002Fbenchmarks\">Benchmarks\u003C\u002Fa> ·\n  \u003Ca href=\"https:\u002F\u002Fomlx.ai\">oMLX.ai\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>English\u003C\u002Fb> ·\n  \u003Ca href=\"README.zh.md\">中文\u003C\u002Fa> ·\n  \u003Ca href=\"README.ko.md\">한국어\u003C\u002Fa> ·\n  \u003Ca href=\"README.ja.md\">日本語\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Fomlx_dashboard.png\" alt=\"oMLX Admin Dashboard\" width=\"800\">\n\u003C\u002Fp>\n\n> *Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.*\n>\n> *oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.*\n\n## Install\n\n### macOS App\n\nDownload the `.dmg` from [Releases](https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\u002Freleases), drag to Applications, done. The app includes in-app auto-update, so future upgrades are just one click. Note that the macOS app does not install the `omlx` CLI command. For terminal usage, install via Homebrew or from source.\n\n### Homebrew\n\n```bash\nbrew tap jundot\u002Fomlx https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\nbrew install omlx\n\n# Upgrade to the latest version\nbrew update && brew upgrade omlx\n\n# Run as a background service (auto-restarts on crash)\nbrew services start omlx\n\n# Optional: MCP (Model Context Protocol) support\n\u002Fopt\u002Fhomebrew\u002Fopt\u002Fomlx\u002Flibexec\u002Fbin\u002Fpip install mcp\n```\n\n### From Source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx.git\ncd omlx\npip install -e .          # Core only\npip install -e \".[mcp]\"   # With MCP (Model Context Protocol) support\n```\n\nRequires macOS 15.0+ (Sequoia), Python 3.10+, and Apple Silicon (M1\u002FM2\u002FM3\u002FM4).\n\n## Quickstart\n\n### macOS App\n\nLaunch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it. To connect OpenClaw, OpenCode, Codex, Hermes Agent, or Copilot, see [Integrations](#integrations).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002FScreenshot 2026-02-10 at 00.36.32.png\" alt=\"oMLX Welcome Screen\" width=\"360\">\n  \u003Cimg src=\"docs\u002Fimages\u002FScreenshot 2026-02-10 at 00.34.30.png\" alt=\"oMLX Menubar\" width=\"240\">\n\u003C\u002Fp>\n\n### CLI\n\n```bash\nomlx serve --model-dir ~\u002Fmodels\n```\n\nThe server discovers LLMs, VLMs, embedding models, and rerankers from subdirectories automatically. Any OpenAI-compatible client can connect to `http:\u002F\u002Flocalhost:8000\u002Fv1`. A built-in chat UI is also available at `http:\u002F\u002Flocalhost:8000\u002Fadmin\u002Fchat`.\n\n### Homebrew Service\n\nIf you installed via Homebrew, you can run oMLX as a managed background service:\n\n```bash\nbrew services start omlx    # Start (auto-restarts on crash)\nbrew services stop omlx     # Stop\nbrew services restart omlx  # Restart\nbrew services info omlx     # Check status\n```\n\nThe service runs `omlx serve` with zero-config defaults (`~\u002F.omlx\u002Fmodels`, port 8000). To customize, either set environment variables (`OMLX_MODEL_DIR`, `OMLX_PORT`, etc.) or run `omlx serve --model-dir \u002Fyour\u002Fpath` once to persist settings to `~\u002F.omlx\u002Fsettings.json`.\n\nLogs are written to two locations:\n- **Service log**: `$(brew --prefix)\u002Fvar\u002Flog\u002Fomlx.log` (stdout\u002Fstderr)\n- **Server log**: `~\u002F.omlx\u002Flogs\u002Fserver.log` (structured application log)\n\n## Features\n\nSupports text LLMs, vision-language models (VLM), OCR models, embeddings, and rerankers on Apple Silicon.\n\n### Admin Dashboard\n\nWeb UI at `\u002Fadmin` for real-time monitoring, model management, chat, benchmark, and per-model settings. Supports English, Korean, Japanese, Chinese, French, and Russian. All CDN dependencies are vendored for fully offline operation.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002FScreenshot 2026-02-10 at 00.45.34.png\" alt=\"oMLX Admin Dashboard\" width=\"720\">\n\u003C\u002Fp>\n\n### Vision-Language Models\n\nRun VLMs with the same continuous batching and tiered KV cache stack as text LLMs. Supports multi-image chat, base64\u002FURL\u002Ffile image inputs, and tool calling with vision context. OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR) are auto-detected with optimized prompts.\n\n### Tiered KV Cache (Hot + Cold)\n\nBlock-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers:\n\n- **Hot tier (RAM)**: Frequently accessed blocks stay in memory for fast access.\n- **Cold tier (SSD)**: When the hot cache fills up, blocks are offloaded to SSD in safetensors format. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Fomlx_hot_cold_cache.png\" alt=\"oMLX Hot & Cold Cache\" width=\"720\">\n\u003C\u002Fp>\n\n### Continuous Batching\n\nHandles concurrent requests through mlx-lm's BatchGenerator. Max concurrent requests is configurable via CLI or admin panel.\n\n### Claude Code Optimization\n\nContext scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.\n\n### Multi-Model Serving\n\nLoad LLMs, VLMs, embedding models, and rerankers within the same server. Models are managed through a combination of automatic and manual controls:\n\n- **LRU eviction**: Least-recently-used models are evicted automatically when memory runs low.\n- **Manual load\u002Funload**: Interactive status badges in the admin panel let you load or unload models on demand.\n- **Model pinning**: Pin frequently used models to keep them always loaded.\n- **Per-model TTL**: Set an idle timeout per model to auto-unload after a period of inactivity.\n- **Process memory enforcement**: Total memory limit (default: system RAM - 8GB) prevents system-wide OOM.\n\n### Per-Model Settings\n\nConfigure sampling parameters, chat template kwargs, TTL, model alias, model type override, and more per model directly from the admin panel. Changes apply immediately without server restart.\n\n- **Model alias**: set a custom API-visible name. `\u002Fv1\u002Fmodels` returns the alias, and requests accept both the alias and directory name.\n- **Model type override**: manually set a model as LLM or VLM regardless of auto-detection.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Fomlx_ChatTemplateKwargs.png\" alt=\"oMLX Chat Template Kwargs\" width=\"480\">\n\u003C\u002Fp>\n\n### Built-in Chat\n\nChat directly with any loaded model from the admin panel. Supports conversation history, model switching, dark mode, reasoning model output, and image upload for VLM\u002FOCR models.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002FScreenShot_2026-03-14_104350_610.png\" alt=\"oMLX Chat\" width=\"720\">\n\u003C\u002Fp>\n\n\n### Model Downloader\n\nSearch and download MLX models from HuggingFace directly in the admin dashboard. Browse model cards, check file sizes, and download with one click.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Fdownloader_omlx.png\" alt=\"oMLX Model Downloader\" width=\"720\">\n\u003C\u002Fp>\n\n### Integrations\n\nSet up OpenClaw, OpenCode, Codex, Hermes Agent, Copilot, and Pi directly from the admin dashboard with a single click. No manual config editing required.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Fomlx_integrations.png\" alt=\"oMLX Integrations\" width=\"720\">\n\u003C\u002Fp>\n\n### Performance Benchmark\n\nOne-click benchmarking from the admin panel. Measures prefill (PP) and text generation (TG) tokens per second, with partial prefix cache hit testing for realistic performance numbers.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Fbenchmark_omlx.png\" alt=\"oMLX Benchmark Tool\" width=\"720\">\n\u003C\u002Fp>\n\n### macOS Menubar App\n\nNative PyObjC menubar app (not Electron). Start, stop, and monitor the server without opening a terminal. Includes persistent serving stats (survives restarts), auto-restart on crash, and in-app auto-update.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002FScreenshot 2026-02-10 at 00.51.54.png\" alt=\"oMLX Menubar Stats\" width=\"400\">\n\u003C\u002Fp>\n\n### API Compatibility\n\nDrop-in replacement for OpenAI and Anthropic APIs. Supports streaming usage stats (`stream_options.include_usage`), Anthropic adaptive thinking, and vision inputs (base64, URL).\n\n| Endpoint | Description |\n|----------|-------------|\n| `POST \u002Fv1\u002Fchat\u002Fcompletions` | Chat completions (streaming) |\n| `POST \u002Fv1\u002Fcompletions` | Text completions (streaming) |\n| `POST \u002Fv1\u002Fmessages` | Anthropic Messages API |\n| `POST \u002Fv1\u002Fembeddings` | Text embeddings |\n| `POST \u002Fv1\u002Frerank` | Document reranking |\n| `GET \u002Fv1\u002Fmodels` | List available models |\n\n### Tool Calling & Structured Output\n\nSupports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the `tools` parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:\n\n| Model Family | Format |\n|---|---|\n| Llama, Qwen, DeepSeek, etc. | JSON `\u003Ctool_call>` |\n| Qwen3.5 Series | XML `\u003Cfunction=...>` |\n| Gemma | `\u003Cstart_function_call>` |\n| GLM (4.7, 5) | `\u003Carg_key>\u002F\u003Carg_value>` XML |\n| MiniMax | Namespaced `\u003Cminimax:tool_call>` |\n| Mistral | `[TOOL_CALLS]` |\n| Kimi K2 | `\u003C\\|tool_calls_section_begin\\|>` |\n| Longcat | `\u003Clongcat_tool_call>` |\n\nModels not listed above may still work if their chat template accepts `tools` and their output uses a recognized `\u003Ctool_call>` XML format. For tool-enabled streaming, assistant text is emitted incrementally while known tool-call control markup is suppressed from visible content; structured tool calls are emitted after parsing the completed turn.\n\n## Models\n\nPoint `--model-dir` at a directory containing MLX-format model subdirectories. Two-level organization folders (e.g., `mlx-community\u002Fmodel-name\u002F`) are also supported.\n\n```\n~\u002Fmodels\u002F\n├── Step-3.5-Flash-8bit\u002F\n├── Qwen3-Coder-Next-8bit\u002F\n├── gpt-oss-120b-MXFP4-Q8\u002F\n├── Qwen3.5-122B-A10B-4bit\u002F\n└── bge-m3\u002F\n```\n\nModels are auto-detected by type. You can also download models directly from the admin dashboard.\n\n| Type | Models |\n|------|--------|\n| LLM | Any model supported by [mlx-lm](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-lm) |\n| VLM | Qwen3.5 Series, GLM-4V, Pixtral, and other [mlx-vlm](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm) models |\n| OCR | DeepSeek-OCR, DOTS-OCR, GLM-OCR |\n| Embedding | BERT, BGE-M3, ModernBERT |\n| Reranker | ModernBERT, XLM-RoBERTa |\n\n## CLI Configuration\n\n```bash\n# Memory limit for loaded models\nomlx serve --model-dir ~\u002Fmodels --max-model-memory 32GB\n\n# Process-level memory limit (default: auto = RAM - 8GB)\nomlx serve --model-dir ~\u002Fmodels --max-process-memory 80%\n\n# Enable SSD cache for KV blocks\nomlx serve --model-dir ~\u002Fmodels --paged-ssd-cache-dir ~\u002F.omlx\u002Fcache\n\n# Set in-memory hot cache size\nomlx serve --model-dir ~\u002Fmodels --hot-cache-max-size 20%\n\n# Adjust max concurrent requests (default: 8)\nomlx serve --model-dir ~\u002Fmodels --max-concurrent-requests 16\n\n# With MCP tools\nomlx serve --model-dir ~\u002Fmodels --mcp-config mcp.json\n\n# HuggingFace mirror endpoint (for restricted regions)\nomlx serve --model-dir ~\u002Fmodels --hf-endpoint https:\u002F\u002Fhf-mirror.com\n\n# API key authentication\nomlx serve --model-dir ~\u002Fmodels --api-key your-secret-key\n# Localhost-only: skip verification via admin panel global settings\n```\n\nAll settings can also be configured from the web admin panel at `\u002Fadmin`. Settings are persisted to `~\u002F.omlx\u002Fsettings.json`, and CLI flags take precedence.\n\n\u003Cdetails>\n\u003Csummary>Architecture\u003C\u002Fsummary>\n\n```\nFastAPI Server (OpenAI \u002F Anthropic API)\n    │\n    ├── EnginePool (multi-model, LRU eviction, TTL, manual load\u002Funload)\n    │   ├── BatchedEngine (LLMs, continuous batching)\n    │   ├── VLMEngine (vision-language models)\n    │   ├── EmbeddingEngine\n    │   └── RerankerEngine\n    │\n    ├── ProcessMemoryEnforcer (total memory limit, TTL checks)\n    │\n    ├── Scheduler (FCFS, configurable concurrency)\n    │   └── mlx-lm BatchGenerator\n    │\n    └── Cache Stack\n        ├── PagedCacheManager (GPU, block-based, CoW, prefix sharing)\n        ├── Hot Cache (in-memory tier, write-back)\n        └── PagedSSDCacheManager (SSD cold tier, safetensors format)\n```\n\n\u003C\u002Fdetails>\n\n## Development\n\n### CLI Server\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx.git\ncd omlx\npip install -e \".[dev]\"\npytest -m \"not slow\"\n```\n\n### macOS App\n\nRequires Python 3.11+ and [venvstacks](https:\u002F\u002Fvenvstacks.lmstudio.ai) (`pip install venvstacks`).\n\n```bash\ncd packaging\n\n# Full build (venvstacks + app bundle + DMG)\npython build.py\n\n# Skip venvstacks (code changes only)\npython build.py --skip-venv\n\n# DMG only\npython build.py --dmg-only\n```\n\nSee [packaging\u002FREADME.md](packaging\u002FREADME.md) for details on the app bundle structure and layer configuration.\n\n## Contributing\n\nContributions are welcome! See [Contributing Guide](docs\u002FCONTRIBUTING.md) for details.\n\n- Bug fixes and improvements\n- Performance optimizations\n- Documentation improvements\n\n## License\n\n[Apache 2.0](LICENSE)\n\n## Acknowledgments\n\n- [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx) and [mlx-lm](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-lm) by Apple\n- [mlx-vlm](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm) - Vision-language model inference on Apple Silicon\n- [vllm-mlx](https:\u002F\u002Fgithub.com\u002Fwaybarrios\u002Fvllm-mlx) - oMLX started from vllm-mlx v0.1.0 and evolved significantly with multi-model serving, tiered KV caching, VLM with full paged cache support, an admin panel, and a macOS menu bar app\n- [venvstacks](https:\u002F\u002Fvenvstacks.lmstudio.ai) - Portable Python environment layering for the macOS app bundle\n- [mlx-embeddings](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-embeddings) - Embedding model support for Apple Silicon\n- [dflash-mlx](https:\u002F\u002Fgithub.com\u002Fbstnxbt\u002Fdflash-mlx) - Block diffusion speculative decoding on Apple Silicon\n","oMLX 是一个专为苹果芯片优化的大型语言模型推理服务器，支持连续批处理和SSD缓存，并可通过macOS菜单栏直接管理。其核心功能包括持续批处理、分层KV缓存以及通过菜单栏进行直观管理，这些特性使得在本地运行大型语言模型变得更加高效与便捷。技术上，oMLX使用Python开发，兼容Apple Silicon架构，适用于需要高性能LLM推理的应用场景，如代码辅助工具Claude Code等。对于希望在Mac上实现快速响应且资源消耗可控的语言模型应用开发者来说，oMLX是一个理想的选择。",2,"2026-06-11 03:32:32","trending"]