[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2119":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":16,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":27,"discoverSource":28},2119,"CoreML-LLM","john-rocky\u002FCoreML-LLM","john-rocky","Run LLMs on Apple devices with CoreML, optimized for Apple Neural Engine + GPU",null,"Python",147,18,2,0,3,9,21,3.84,"MIT License",false,"main",true,[],"2026-06-12 02:00:37","# CoreML-LLM\n\n**On-device LLMs on the Apple Neural Engine.** Run Gemma 4, Qwen3.5, Qwen3-VL, FunctionGemma, EmbeddingGemma, and Liquid AI's LFM2.5 on iPhone with CoreML — ANE-first, battery-friendly, no server.\n\nWhere [MLX Swift](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-swift) is the right call when you want maximum GPU throughput, CoreML-LLM is what you use when the LLM should live on the **ANE** so the GPU stays free for the rest of the app.\n\n[![App Store](https:\u002F\u002Ftoolbox.marketingtools.apple.com\u002Fapi\u002Fv2\u002Fbadges\u002Fdownload-on-the-app-store\u002Fblack\u002Fen-us?releaseDate=1735689600)](https:\u002F\u002Fapps.apple.com\u002Fjp\u002Fapp\u002Fmodels-zoo\u002Fid6762083207)\n\n## Use in your app\n\nAdd the package, name a model, generate.\n\n```swift\n\u002F\u002F Package.swift\n.package(url: \"https:\u002F\u002Fgithub.com\u002Fjohn-rocky\u002FCoreML-LLM\", from: \"1.9.0\")\n```\n\n```swift\nimport CoreMLLLM\n\nlet llm = try await CoreMLLLM.load(repo: \"lfm2.5-350m\")\nlet answer = try await llm.generate(\"What is the capital of France?\")\n```\n\n`repo:` accepts a registered model id (`\"gemma4-e2b\"`, `\"qwen3.5-0.8b\"`, `\"lfm2.5-350m\"`, …) or a full HuggingFace path — first call downloads, later calls reuse the on-device bundle. Streaming, multi-turn chat, image \u002F video \u002F audio, FunctionGemma, EmbeddingGemma → [package docs (Quick Start → Swift Package)](#swift-package).\n\n## Models\n\n| Model | Size | Task | iPhone 17 Pro decode | HuggingFace |\n|---|---:|---|---:|---|\n| **Gemma 4 E2B** | 5.4 GB (4.4 GB text-only) | Text + image + video + audio | **34.2 tok\u002Fs** | [mlboydaisuke\u002Fgemma-4-E2B-coreml](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fgemma-4-E2B-coreml) |\n| **Gemma 4 E4B** | 8.16 GB multimodal \u002F 5.5 GB text-only | Text + image + video + audio | **15.7 tok\u002Fs** | [multimodal](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fgemma-4-E4B-multimodal-coreml) · [text-only](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fgemma-4-E4B-coreml) |\n| **Qwen3.5 2B** | 2.8 GB | Text | **~27 tok\u002Fs** | [mlboydaisuke\u002Fqwen3.5-2B-CoreML](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fqwen3.5-2B-CoreML) |\n| **Qwen3.5 0.8B** | 1.2 GB | Text | **~48 tok\u002Fs** | [mlboydaisuke\u002Fqwen3.5-0.8B-CoreML](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fqwen3.5-0.8B-CoreML) |\n| **Qwen3-VL 2B (stateful)** | 2.3 GB | Text + image | **~24 tok\u002Fs** | [mlboydaisuke\u002Fqwen3-vl-2b-stateful-coreml](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fqwen3-vl-2b-stateful-coreml) |\n| **LFM2.5 350M** [†](#lfm2-license) | 810 MB | Text | **52 tok\u002Fs** | [mlboydaisuke\u002Flfm2.5-350m-coreml](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Flfm2.5-350m-coreml) |\n| **FunctionGemma-270M** | 850 MB | Function calling | (specialist) | [mlboydaisuke\u002Ffunctiongemma-270m-coreml](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Ffunctiongemma-270m-coreml) |\n| **EmbeddingGemma-300M** | 295 MB | Sentence embeddings | (specialist) | [mlboydaisuke\u002Fembeddinggemma-300m-coreml](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fembeddinggemma-300m-coreml) |\n| Qwen3-VL 2B (legacy) | 2.9 GB | Text + image | ~7.5 tok\u002Fs | [mlboydaisuke\u002Fqwen3-vl-2b-coreml](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fqwen3-vl-2b-coreml) |\n| Qwen2.5 0.5B | 302 MB | Text | — | [mlboydaisuke\u002Fqwen2.5-0.5b-coreml](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fqwen2.5-0.5b-coreml) |\n\nAll numbers are iPhone 17 Pro A19 Pro, 2048-token context, ANE-only (no GPU fallback at runtime unless noted). Methodology: [docs\u002FBENCHMARKING.md](docs\u002FBENCHMARKING.md).\n\n**Which one should I pick?**\n- Multimodal (image \u002F video \u002F audio), fastest → **Gemma 4 E2B** (34 tok\u002Fs)\n- Multimodal, highest quality → **Gemma 4 E4B (multimodal)** (15.7 tok\u002Fs)\n- Image + text chat, lowest memory + fastest follow-up → **Qwen3-VL 2B (stateful)**\n- Text-only, maximum quality under ≤3 GB → **Qwen3.5 2B**\n- Text-only, maximum quality → **Gemma 4 E4B (text-only)**\n- Text-only, fast + chat-strong → **Qwen3.5 0.8B** (48 tok\u002Fs)\n- Text-only, smallest at high tok\u002Fs on iPhone → **LFM2.5 350M** (52 tok\u002Fs, 810 MB) [†](#lfm2-license)\n- Tool \u002F function calling → **FunctionGemma-270M**\n- Sentence embeddings \u002F RAG → **EmbeddingGemma-300M**\n\n## Demos\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd align=\"center\" width=\"50%\">\u003Cb>Text\u003C\u002Fb>\u003Cbr>\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F67584300-ce34-4aa5-b3bd-5521cfe8855a\" width=\"100%\">\u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"50%\">\u003Cb>Image\u003C\u002Fb>\u003Cbr>\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2a869bf5-8315-422d-8b06-a4a7edecd173\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\u003Cb>Video\u003C\u002Fb>\u003Cbr>\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F1d2a9ff3-2912-40e9-895d-fbaa3c73ee3a\" width=\"100%\">\u003C\u002Ftd>\n    \u003Ctd align=\"center\">\u003Cb>Audio\u003C\u002Fb>\u003Cbr>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fe8deb6d0-d8b0-4210-885c-5d7a7ddc7ad3\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## Quick Start\n\n### Try it — App Store\n\n**[Models Zoo](https:\u002F\u002Fapps.apple.com\u002Fjp\u002Fapp\u002Fmodels-zoo\u002Fid6762083207)** is a pre-built app shipping CoreML-LLM. Open it, pick a model, download, chat.\n\n### Build from source\n\n```bash\nopen Examples\u002FCoreMLLLMChat\u002FCoreMLLLMChat.xcodeproj\n```\n\nSet your development team → build to an iOS 18+ device → **Get Model** → download → chat. Compute units default to `.cpuAndNeuralEngine` (ANE).\n\n### Swift Package\n\n```swift\ndependencies: [\n    .package(url: \"https:\u002F\u002Fgithub.com\u002Fjohn-rocky\u002FCoreML-LLM\", from: \"1.9.0\"),\n]\n```\n\n```swift\nimport CoreMLLLM\n\n\u002F\u002F Download + load in one call\nlet llm = try await CoreMLLLM.load(model: .gemma4e2b) { print($0) }\n\n\u002F\u002F Simple \u002F streaming \u002F multi-turn\nlet answer = try await llm.generate(\"What is the capital of France?\")\nfor await tok in try await llm.stream(\"Tell me a story\") { print(tok, terminator: \"\") }\n\nlet messages: [CoreMLLLM.Message] = [\n    .init(role: .user, content: \"Hi!\"),\n    .init(role: .assistant, content: \"Hello!\"),\n    .init(role: .user, content: \"What is 2+2?\"),\n]\nfor await tok in try await llm.stream(messages) { print(tok, terminator: \"\") }\n\n\u002F\u002F Multimodal (Gemma 4)\nlet caption   = try await llm.generate(\"Describe this image\", image: cgImage)\nlet transcript = try await llm.generate(\"What did they say?\", audio: pcmSamples)\nlet analysis   = try await llm.generate(\n    \"Describe this video frame by frame.\",\n    videoURL: URL(fileURLWithPath: \"\u002Fpath\u002Fto\u002Fclip.mp4\"),\n    videoOptions: .init(fps: 1.0, maxFrames: 6))\n\n\u002F\u002F Fastest decode on iPhone 17 Pro A19 Pro: opt into the 3-chunk path.\n\u002F\u002F Set in the Xcode scheme: Environment Variables → LLM_3CHUNK = 1.\n\u002F\u002F +8.2 % tok\u002Fs, bit-equivalent to the default 4-chunk decode.\n```\n\nDownloads run in the background via `URLSessionConfiguration.background` with pause\u002Fresume support:\n\n```swift\nlet url = try await ModelDownloader.shared.download(.gemma4e2b)\nModelDownloader.shared.pause()\nModelDownloader.shared.resumeDownload()\n```\n\n### FunctionGemma + EmbeddingGemma\n\nTwo specialists with their own narrow Swift APIs. Ship them alongside a chat model (Gemma 4, Qwen3.5) for tool calling + RAG.\n\n```swift\nimport CoreMLLLM\n\nlet dir = FileManager.default\n    .urls(for: .applicationSupportDirectory, in: .userDomainMask)[0]\n\n\u002F\u002F Function calling (850 MB, ≥ 92% ANE, batched prefill T=32)\nlet fg = try await FunctionGemma.downloadAndLoad(modelsDir: dir)\nlet (text, call) = try fg.generateFunctionCall(\n    userPrompt: \"Turn on the flashlight\",\n    tools: [[\n        \"type\": \"function\",\n        \"function\": [\n            \"name\": \"toggle_flashlight\",\n            \"description\": \"Turn the phone flashlight on or off.\",\n            \"parameters\": [\"type\": \"object\", \"properties\": [:], \"required\": []],\n        ],\n    ]])\n\u002F\u002F call = \"call:toggle_flashlight{}\"\n\n\u002F\u002F Embeddings (295 MB, 99.80% ANE, Matryoshka 768\u002F512\u002F256\u002F128)\nlet eg = try await EmbeddingGemma.downloadAndLoad(modelsDir: dir)\nlet vec = try eg.encode(text: \"How do cats behave?\",\n                        task: .retrievalQuery, dim: 768)\n```\n\nStandalone sample at `Examples\u002FGemma3Demo\u002F` imports `CoreMLLLM` and exercises both without pulling the Gemma 4 chat stack. Full I\u002FO contracts in [docs\u002FFUNCTIONGEMMA.md](docs\u002FFUNCTIONGEMMA.md) + [docs\u002FEMBEDDINGGEMMA.md](docs\u002FEMBEDDINGGEMMA.md).\n\n## Convert your own\n\n```bash\ncd conversion\npython3 -m venv .venv && source .venv\u002Fbin\u002Factivate\npip install -r requirements.txt\n\n# Qwen2.5 0.5B (~2 min)\npython convert.py --model qwen2.5-0.5b --output .\u002Foutput\u002Fqwen2.5-0.5b\n\n# Gemma 4 — one-shot bundle builder (chunks + embeds + PLE + RoPE +\n# tokenizer + model_config.json, ready for USB sideload or HF upload)\npython build_gemma4_bundle.py --model gemma4-e2b --ctx 2048\npython build_gemma4_bundle.py --model gemma4-e4b --ctx 2048\n\n# Gemma 4 E2B 3-chunk decode (default since v1.7, +8.2 % tok\u002Fs on iPhone A19 Pro)\npython build_gemma4_3way.py --model gemma4-e2b --ctx 2048\npython install_3way_bundle.py\n\n# Specialists\npython build_functiongemma_bundle.py --ctx 2048 --quantize int8 --prefill-t 32\npython build_embeddinggemma_bundle.py --max-seq-len 128 --quantize int8\n\n# LFM2.5 350M (Liquid AI hybrid attn + short-conv) — sideload-ready bundle\npython build_lfm2_bundle.py --model lfm2.5-350m --l-pad 3\n```\n\nStep-by-step: [docs\u002FADDING_MODELS.md](docs\u002FADDING_MODELS.md). Full reference (quant, `.mlpackage` → `.mlmodelc`, iPhone deployment): [docs\u002FCONVERSION.md](docs\u002FCONVERSION.md). LFM2-specific deep-dive (ChatML template, dual-state ANE blocker, fp16 short-conv drift): [docs\u002FLFM2_CONVERSION_FINDINGS.md](docs\u002FLFM2_CONVERSION_FINDINGS.md).\n\n## Documentation\n\nDesign docs, benchmarks, and per-model conversion notes live in [docs\u002F](docs\u002FREADME.md). Start with [docs\u002FARCHITECTURE.md](docs\u002FARCHITECTURE.md) for the chunked decode design, ANE optimizations, MLX comparison, and project layout.\n\n## What's new\n\nCurrent release: **v1.9.0** ([release notes](https:\u002F\u002Fgithub.com\u002Fjohn-rocky\u002FCoreML-LLM\u002Freleases\u002Ftag\u002Fv1.9.0)).\n\n- **v1.9.0** — Gemma 4 E4B multimodal (text + image + video + audio) on iPhone 17 Pro at **15.7 tok\u002Fs** decode. Topology II 3-chunk decode (`chunk1` + `chunk2_3way` + `chunk3_3way`) + legacy 4-chunk `prefill_b8` multifunction with vision-aware bidirectional mask. E4B-built `vision.ane.mlmodelc` (output `[1, 256, 2560]`) + Conformer audio + Swift two-stage projection (1024 → 1536 → 2560, non-square `embed_proj`). New picker entry \"Gemma 4 E4B (multimodal)\" auto-downloads from [`mlboydaisuke\u002Fgemma-4-E4B-multimodal-coreml`](https:\u002F\u002Fhuggingface.co\u002Fmlboydaisuke\u002Fgemma-4-E4B-multimodal-coreml) (~8.16 GB); text-only entry kept at the existing HF repo. Build + sideload guide: [docs\u002FE4B_MULTIMODAL_BUILD.md](docs\u002FE4B_MULTIMODAL_BUILD.md).\n- **v1.8.0** — Qwen3.5 0.8B \u002F 2B full-vocab rep_penalty masks iPhone A18 fp16 ANE reduction bias. 0.8B: 48 tok\u002Fs, 2B: 27 tok\u002Fs on iPhone 17 Pro, all clean output across English + Japanese. +45 % over the prior v1.x ceiling. See [docs\u002FQWEN35_FULL_VOCAB_REP_PENALTY.md](docs\u002FQWEN35_FULL_VOCAB_REP_PENALTY.md).\n- **v1.7.0** — Gemma 4 E2B 3-chunk decode is the picker default + multimodal opt-out toggle. The new `gemma4e2b3way` ModelInfo ships `chunk2_3way` (L8-24 merged) + `chunk3_3way` (L25-34 + lm_head) and re-uses legacy `chunk1` + 4-chunk prefill graphs (vision-aware bidirectional mask preserved). Decode `c1+c2+c4` (chunk3 nil) — 3 ANE dispatches\u002Fstep, **34.2 tok\u002Fs** on iPhone 17 Pro A19 Pro. The 4-chunk legacy entry stays as `Gemma 4 E2B (4-chunk legacy)`. ModelPickerView's \"Download Options → Include multimodal\" toggle drops vision\u002Fvideo\u002Faudio encoders + sidecars when off (~1 GB savings, text-only install). finishDownload now hardlinks shared decode↔prefill weights instead of copying (`chunk1↔prefill_chunk1` and `chunk3_3way↔prefill_chunk4`, **−682 MB on disk**).\n- **v1.6.0** — Qwen3-VL 2B stateful Phase 2: cross-turn KV reuse + ANE prewarm. Same-prompt 2nd TTFT **4 s → 125 ms** (~32×), vision-chat 2nd-turn TTFT 125 ms (target was \u003C500 ms). LCP-matched MLState resume + image-pinned-to-first-user-turn prompt builder + per-chunk dummy predict at load (231 ms total).\n- **v1.5.0** — Qwen3-VL 2B stateful Phase 1: MLState + slice_update KV cache + multifunction prefill_b8. **24 tok\u002Fs decode at 256 MB phys_footprint** on iPhone 17 Pro (vs 7.5 tok\u002Fs \u002F 1.7 GB on the v1.3 recurrent build — 3.2× decode, 6.4× memory drop). 4-chunk INT8 + fp16 embed sidecar.\n- **v1.4.0** — Gemma 4 E2B 3-chunk decode (opt-in, `LLM_3CHUNK=1`): 31.6 → **34.2 tok\u002Fs** on iPhone 17 Pro A19 Pro (+8.2 %). Bit-equivalent to 4-chunk by construction. Closes the ANE-ceiling sweep for E2B; five additional lossless probes (SDPA fusion, K=V alias, Topology I boundary search, blockwise palettization, native softmax) all landed as negative results — see [docs\u002FEXPERIMENTS.md](docs\u002FEXPERIMENTS.md).\n- **v1.3.0** — Qwen3-VL 2B (text + vision on ANE, 196 image tokens, DeepStack injection at L0\u002F1\u002F2, interleaved mRoPE for image tokens). 28-layer GQA, 2.9 GB bundle, ~7.5 tok\u002Fs text decode. (Recurrent KV — superseded by v1.5.0 stateful build; kept for backward compatibility.)\n- **v1.2.0** — FunctionGemma-270M (function calling, batched prefill T=32) and EmbeddingGemma-300M (99.80 % ANE, Matryoshka 768\u002F512\u002F256\u002F128). Standalone `Gemma3Demo` sample.\n- **v1.1.0** — Qwen3.5 2B (4 INT8 chunks + mmap fp16 embed sidecar, ~200 MB phys_footprint for a 2B-param model).\n- **v1.0.0** — Qwen3.5 0.8B (first hybrid SSM+attention LLM on CoreML, 99.9 % ANE).\n- **v0.8.0** — Gemma 4 E4B (42-layer text decoder, 100 % ANE).\n- **v0.7.0** — Video multimodal (native 384×384 vision encoder, 64 tokens\u002Fframe).\n- **v0.6.2** — Audio multimodal (12-layer Conformer encoder).\n\nFull history: [GitHub Releases](https:\u002F\u002Fgithub.com\u002Fjohn-rocky\u002FCoreML-LLM\u002Freleases).\n\n## Requirements\n\n- **Inference**: iOS 18+ \u002F macOS 15+\n- **Conversion**: Python 3.10–3.12, coremltools 8+, PyTorch 2.2+\n- **Sample apps**: Xcode 16+\n\n## License\n\nMIT for the CoreML-LLM code. Model weights inherit the original licenses (Gemma weights: [Gemma Terms of Use](https:\u002F\u002Fai.google.dev\u002Fgemma\u002Fterms); Qwen weights: Apache 2.0; Qwen3-VL vision weights: Apache 2.0).\n\n\u003Ca id=\"lfm2-license\">\u003C\u002Fa>\n**† LFM2.5 350M** weights are under [LFM Open License v1.0](https:\u002F\u002Fhuggingface.co\u002FLiquidAI\u002FLFM2.5-350M\u002Fblob\u002Fmain\u002FLICENSE) (Liquid AI). Free for non-commercial use, research, and commercial use **up to a US $10M annual revenue threshold**. Above that threshold, see [Liquid AI](https:\u002F\u002Fwww.liquid.ai\u002F) for a separate commercial license.\n","CoreML-LLM 是一个在苹果设备上运行大语言模型（LLM）的项目，特别针对 Apple Neural Engine (ANE) 和 GPU 进行了优化。其核心功能包括支持多种预训练模型如 Gemma 4、Qwen3.5 等，并且能够直接在 iPhone 上通过 CoreML 执行文本生成、多模态处理及函数调用等任务，无需依赖服务器，同时保持较低的功耗。此外，它还允许用户以 Swift 包的形式轻松集成到自己的应用中，只需几行代码即可加载模型并开始生成内容。该项目非常适合需要在移动设备上离线运行复杂 AI 模型的应用场景，例如聊天机器人、图像识别助手或语音交互系统等。","2026-06-11 02:48:14","CREATED_QUERY"]