[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-132":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":40,"readmeContent":41,"aiSummary":42,"trendingCount":16,"starSnapshotCount":16,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},132,"MTPLX","youssofal\u002FMTPLX","youssofal","2.24x decode TPS increase On Qwen 3.6 27B @ temp 0.6 | Native MTP Speculative Decoding On Apple Silicon With No External Drafter.","https:\u002F\u002Fmtplx.com",null,"Python",703,37,13,17,0,22,63,375,66,98.74,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37,38,39],"anthropic-compatible","apple-silicon","inference-engine","local-ai","metal","mlx","mtp","mtplx","native-mtp","openai-compatible","qwen","qwen3-next","speculative-decoding","speculative-sampling","2026-06-12 04:00:02","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"docs\u002Fassets\u002Freadme\u002Fhero.svg\" alt=\"MTPLX\" width=\"100%\" \u002F>\n\n# **Native MTP speculative decoding on Apple Silicon**\n\n**~2.24× over no-MTP AR at `temp=0.6`** on Qwen3.6-27B · math-correct rejection sampling · MLX-native · zero external drafter\n\n\u003Csub>Multiplier is measured in paired same-machine runs. Absolute tok\u002Fs scales with memory bandwidth — current public record on M5 Max: **63.056 \u002F 62.886 tok\u002Fs** D3, [`Youssofal\u002FQwen3.6-27B-MTPLX-Optimized-Speed`](https:\u002F\u002Fhuggingface.co\u002FYoussofal\u002FQwen3.6-27B-MTPLX-Optimized-Speed).\u003C\u002Fsub>\n\n[![CI](https:\u002F\u002Fgithub.com\u002Fyoussofal\u002Fmtplx\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fyoussofal\u002Fmtplx\u002Factions\u002Fworkflows\u002Fci.yml)\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fmtplx?label=PyPI)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fmtplx\u002F)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.11%2B-blue)](https:\u002F\u002Fwww.python.org\u002F)\n[![macOS Apple Silicon](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FmacOS-Apple%20Silicon-black?logo=apple)](https:\u002F\u002Fdeveloper.apple.com\u002Fmetal\u002F)\n[![Status](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fstatus-v0.3.7-blue)](CHANGELOG.md)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-green)](LICENSE)\n\n\u003C\u002Fdiv>\n\n---\n\n**Permissive open source.** MTPLX is Apache-2.0: you can use it, modify it, and ship it commercially. If you redistribute MTPLX, keep the license and NOTICE attribution; for public projects and research, visible credit or citation is strongly appreciated.\n\nMTPLX runs **the model's own built-in MTP heads** as a speculative drafter, with **exact probability-ratio acceptance + residual correction** — not the greedy-argmax trick most fast-decode tools use at T>0. That means real coding settings (`temperature=0.6`, `top_p=0.95`, `top_k=20`) actually get the speculative speedup *and* keep the target model's distribution.\n\nThis is **not** DFlash, DDTree, llama-spec, or an external-drafter system. It's a native-MTP runtime built around MLX, Apple Silicon, and a real OpenAI\u002FAnthropic-compatible serving surface.\n\nInstall:\n\n```bash\nbrew install youssofal\u002Fmtplx\u002Fmtplx\nmtplx start            # interactive: pick model → mode → web\u002FCLI\u002FPi\u002FOpenCode\u002FSwival, then chat\n```\n\nThe Homebrew installer sets up the `mtplx` command in `\u002Fopt\u002Fhomebrew\u002Fbin` and bootstraps the Python runtime under `\u002Fopt\u002Fhomebrew\u002Fvar\u002Fmtplx`. Python users can also run `python3 -m pip install -U mtplx`.\n\nThat's it. The wizard handles the default speed model (`Youssofal\u002FQwen3.6-27B-MTPLX-Optimized-Speed`), runtime mode (Sustained \u002F Sustained Max \u002F Burst), and surface (browser chat at `127.0.0.1:8000\u002F`, terminal chat, Pi, OpenCode, or Swival) on first run. On every subsequent run it asks \"same as last time?\" so you're one keypress from chatting.\n\nFor a measured depth choice on your own Mac, run:\n\n```bash\nmtplx tune --model \u002Fpath\u002Fto\u002Fmodel --retune\n```\n\nTune compares AR, D1, D2, and D3 with thinking disabled, keeps AR as the `1.00x`\nbaseline, and only saves a recommendation when an MTP depth is actually faster.\nFor hardware diagnosis, `mtplx bench tune` prints and saves per-candidate\npower, frequency, temperature, utilization, fan, and thermal-pressure telemetry\nwhen `thermalforge` and macOS `powermetrics` are available. Telemetry is labeled\nas `scope=generation` when samples land inside the actual generation window;\notherwise it is reported as broader candidate-process telemetry.\n\n---\n\n## What you get\n\n- **Native MTP speculative decoding.** Built-in MTP heads, no external drafter, no RAM hit for a second model.\n- **Math-correct sampling at T=0.6.** Probability-ratio acceptance with residual correction. Verified `max_diff = 0.0` against reference single-token AR on the verified Qwen3.6-27B path.\n- **~2.24× over no-MTP AR at `temp=0.6`.** This is a paired same-machine multiplier, which the CLI reports as `mean_speedup_vs_ar`. Verified contract on the public default `Youssofal\u002FQwen3.6-27B-MTPLX-Optimized-Speed`: `63.056 \u002F 62.886 tok\u002Fs` MTP-D3 vs `28.156 tok\u002Fs` no-MTP AR, on Apple Silicon M5 Max with `--max` fans, target sampler `temp=0.6 top_p=0.95 top_k=20`, draft sampler `temp=0.70`. Absolute tok\u002Fs scales with memory bandwidth.\n- **Real serving surface.** OpenAI-compatible `\u002Fv1\u002Fchat\u002Fcompletions` + `\u002Fv1\u002Fcompletions` + `\u002Fv1\u002Fmodels`, Anthropic-compatible `\u002Fv1\u002Fmessages` (streaming SSE), `\u002Fhealth`, `\u002Fmetrics`. Plug it into Pi, Open WebUI, OpenClaw, Claude Code, Cline, Continue, or anything that speaks OpenAI.\n- **Agent tool calls.** OpenAI-style `tools` \u002F `tool_choice`, structured `message.tool_calls`, streaming `delta.tool_calls`, and tool-result history are supported so agent clients execute tools instead of printing Qwen tool markup.\n- **In-browser chat UI** with auto-detected model context (256k for Qwen3.6), live tokens-per-second, markdown rendering, code-block copy buttons, a stop button, an MTP on\u002Foff toggle, and a settings sidebar that persists per-machine.\n- **Interactive start wizard.** Pick model, mode, and surface in three numbered prompts. Returning users get \"same as last time?\". No flag-soup required.\n- **Local-folder model picker.** Point the wizard at any parent directory — your `~\u002Fmodels\u002F`, the LM Studio cache, the HuggingFace cache — and it walks the tree, classifies each model into the four-tier compatibility contract, and presents a numbered picker. Config-only classification, never mmaps a tensor file, so a single APFS-dataless or partial download in the tree can't crash the picker.\n- **One-line live download progress.** Single rich-rendered line with bar \u002F percent \u002F GB \u002F speed \u002F ETA, streamed at 8 fps. HuggingFace's tqdm bars are suppressed during the download so they don't fight the MTPLX UI for terminal real estate.\n- **Honest mode names that tell you what they do.**\n  - `Sustained` — default long-context native-MTP path (`--profile sustained`) with chunked prefill, final-token logits, request-sized paged KV, and the normal Apple fan controller.\n  - `Sustained Max` — Sustained plus ThermalForge fans pinned at 100% (`--profile sustained --max`) for explicit fan-backed long-context runs.\n  - `Burst` — old max-fan performance-cold lane (`--profile performance-cold --max`), **not recommended**, max 8K context. This is the recorded headline lane (`63.056 \u002F 62.886 tok\u002Fs` MTP vs `28.156 tok\u002Fs` AR on M5 Max), loud by design.\n  - `Stable` — hidden compatibility flag (`--profile stable` \u002F `--profile safe`) for the exact\u002Fstaged long-reply path.\n- **Crash-safe fan control.** When Sustained Max or Burst is on, MTPLX spawns a detached watchdog that restores fans to auto if the parent dies for any reason — including `kill -9` and \"I closed the terminal\". Verified live on hardware.\n- **Idle-aware fan-backed modes.** Server tracks request activity; after 15 minutes of no chat, fans drop to auto, then ramp back up on the next message.\n- **Per-Mac Tune.** `mtplx tune`, `mtplx-tune`, and `mtplx bench tune` compare AR\u002FD1\u002FD2\u002FD3 in isolated subprocesses and persist the winning depth per model, hardware, software, and settings. If no MTP depth beats AR, nothing worse is saved. `mtplx bench tune` also records MX Power Gadget-style power, frequency, temperature, utilization, fan, and thermal-pressure telemetry per candidate for chip-level diagnosis, with generation-window scope when available.\n- **Four-tier model compatibility contract.** `mtplx inspect \u003Cmodel>` reports: verified \u002F arch-compatible-unverified \u002F incompatible-architecture \u002F no-MTP. No silent garbage runs.\n- **Lazy imports.** `mtplx --help`, `doctor`, `inspect`, `init`, `setup` work on a fresh venv *without MLX installed*. Generation and serving pull in MLX only when needed.\n- **v0.3.7 default-model, Tune, and Claude Code fixes.** The verified default now resolves cleanly from `start`, `quickstart`, and Tune paths; Tune reports decode speed honestly and surfaces child failures clearly; Claude Code can run real client tools through the Anthropic Messages API.\n\n> **Release honesty.** Burst is the old fan-backed headline lane and is capped in the UI at short contexts only. Sustained is the explicit long-context memory-safety lane; it is not an AR downgrade. Long no-fan decode decay remains a future runtime track; see [Roadmap](#roadmap).\n\n---\n\n## Quick start (full)\n\n```bash\n# 1. Install on macOS\nbrew install youssofal\u002Fmtplx\u002Fmtplx\n\n# 2. Verify the install\nmtplx help\nmtplx doctor --json\n\n# 3. Chat (the wizard does everything)\nmtplx start\n```\n\nPower-user shortcuts (any of these skip the wizard):\n\n```bash\nmtplx start --fresh                         # re-run the wizard from scratch\nmtplx start cli                             # terminal chat directly\nmtplx start pi                              # configure Pi and start MTPLX for Pi\nmtplx start opencode                        # configure OpenCode Desktop\nmtplx start swival                          # print the Swival generic-provider handoff\nmtplx tune --model \u002Fpath\u002Fto\u002Fmodel --retune  # compare AR\u002FD1\u002FD2\u002FD3 and save a real winner\nmtplx start cli --no-mtp                    # target-only AR generation\nmtplx start --profile sustained             # long-context native-MTP mode\nmtplx start --max                           # Sustained Max browser chat with fan boost\nmtplx start --model \u002Fpath\u002Fto\u002Fmodel          # use a specific local or HF model\nmtplx pull Youssofal\u002FQwen3.6-27B-MTPLX-Optimized-Speed\nmtplx quickstart --profile sustained --port 8000  # API server only, no chat\n```\n\nIf `mtplx start pi` cannot find the `pi` command, it stops before loading the\nmodel, prints `npm install -g @earendil-works\u002Fpi-coding-agent`, and offers to\ninstall Pi from the wizard path.\n\nOpenAI-compatible smoke test:\n\n```bash\ncurl http:\u002F\u002F127.0.0.1:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H 'Content-Type: application\u002Fjson' \\\n  -d '{\"model\":\"mtplx\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}],\"stream\":true}'\n```\n\nSet `\"generation_mode\":\"ar\"` per request to compare target-only AR against the default native-MTP path without unloading the model.\n\nHomebrew writes a durable launcher so `mtplx` works from a normal new Terminal tab. The older script installer remains available if you prefer a user-local venv:\n\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fyoussofal\u002FMTPLX\u002Fmain\u002Fscripts\u002Finstall_macos.sh | bash\n```\n\nFor Python-only installs, PyPI is also available:\n\n```bash\npython3 -m pip install -U mtplx\n```\n\n---\n\n## How it actually works\n\nMost \"fast decode on Apple Silicon\" projects fall into one of three buckets:\n\n| Approach | What they do at T>0 | What MTPLX does |\n|---|---|---|\n| llama.cpp \u002F mlx-lm AR | No speculation, target model only | Speculative with a built-in drafter |\n| DFlash, prefix-match speculation | Greedy-argmax equality (silently breaks at T>0) | Probability-ratio acceptance + residual correction |\n| External-drafter speculation | Loads a second model into RAM | Uses the target's own MTP heads — zero extra RAM |\n\nThe math-correctness wedge is real. At `temperature=0.6`, the difference between \"rejected because the draft argmax disagrees\" and \"rejected via the Leviathan\u002FChen rejection-sampling theorem\" is the difference between a benchmark trick and a runtime your code editor can trust. MTPLX does the latter, including residual correction `(p − q)+` for the cases where the draft was rejected.\n\n**Verified evidence (current public default `Youssofal\u002FQwen3.6-27B-MTPLX-Optimized-Speed`):**\n- **~2.24× over matched no-MTP AR at `temp=0.6`** on Apple Silicon M5 Max: `63.056 \u002F 62.886 tok\u002Fs` MTP-D3 paired runs vs `28.156 tok\u002Fs` no-MTP AR, same machine, same target sampler (`temp=0.6 top_p=0.95 top_k=20`), draft sampler `temp=0.70 top_p=0.95 top_k=20`, performance-cold profile, fans pinned by `--max`, thinking mode off. Recorded in `mtplx_runtime.json` under the model.\n- The multiplier is what the CLI reports as `mean_speedup_vs_ar` for a paired same-machine run. Absolute tok\u002Fs above is M5-Max-with-614-GB\u002Fs-bandwidth-specific; slower memory bandwidth lowers the absolute number, and users should run `mtplx tune` \u002F `mtplx bench tune` to measure their own Mac.\n- Per-position acceptance on the recorded prompt: `[100%, 97.96%, 93.88%]` at D3 (corrections=3 over 49 verify calls).\n- Distribution exactness vs reference single-token AR: `max_diff = 0.0`. Greedy diagnostic on the same cleaned window: `60.108 tok\u002Fs`.\n\n\u003Cimg src=\"docs\u002Fassets\u002Freadme\u002Fflow.svg\" alt=\"MTPLX end-to-end flow\" width=\"100%\" \u002F>\n\nNo second model, no greedy hack, no external drafter, no silent distribution drift.\n\n---\n\n## Modes\n\nPicked by `mtplx start`, or set explicitly via `--profile`. Every mode preserves exactness; the difference is the runtime path and whether MTPLX touches your fans.\n\n| Mode | Profile | Mechanics | Speed lane | Best for |\n|---|---|---|---|---|\n| **Sustained** | `sustained` | Native-MTP long-context path with chunked prefill, final-token logits, request-sized paged KV, normal Apple fan controller | Target \u003C=10-15% TPS cost vs Burst while avoiding the old long-context memory balloon | Large files, long documents, coding contexts, 16K-200K prompts |\n| **Sustained Max** | `sustained` + `--max` | Sustained path plus ThermalForge pinned to 100% | Fan-backed long-context lane | Long-context work where the user explicitly wants maximum cooling |\n| **Burst** | `performance-cold` + `--max` | Old max-fan performance-cold lane | **~2.24× over no-MTP AR** (recorded: 63.056\u002F62.886 vs 28.156 tok\u002Fs on M5 Max) | Short prompts and benchmarks only; **not recommended**, max 8K context |\n| **Stable** | `stable` \u002F `safe` | Exact\u002Fstaged long-reply path, hidden from onboarding | Lower peak speed, steadier shape | Compatibility and conservative long replies |\n\nFan-backed modes require ThermalForge. `mtplx max --install` installs it from source into `~\u002F.mtplx\u002Fbin\u002Fthermalforge`, sets up a passwordless sudoers rule scoped to that one binary, and verifies fans actually ramp before declaring success. One sudo prompt, end-to-end. Crash safety covers SIGINT, SIGTERM, SIGHUP, terminal close, and `kill -9` via a detached sidecar process.\n\n---\n\n## Compatibility\n\n```bash\nmtplx inspect \u003Cmodel-path-or-hf-repo> --json\n```\n\n| Tier | Means | Behavior |\n|---|---|---|\n| **Verified** | Has `mtplx_runtime.json` and passed MTPLX gates | Runs |\n| **Arch-compatible, unverified** | Qwen3-Next MTP markers detected, no runtime contract | Refuses unless `--unsafe-force-unverified` |\n| **Incompatible architecture** | MTP exists but not Qwen3-Next | Clear error, roadmap pointer |\n| **No MTP** | No MTP head detected | Clear error, no garbage runs |\n\nv0.1 ships verified Qwen3.6-27B via `Youssofal\u002FQwen3.6-27B-MTPLX-Optimized-Speed`, with public served model id `mtplx-qwen36-27b-optimized-speed`. The compatibility registry already detects DeepSeek V3 \u002F V3.2, GLM-4 MoE \u002F MoE-Lite, MiMo, and MiniMax M2 — unsupported runtime families stay behind explicit compatibility gates rather than silently running.\n\n### Support matrix\n\n| Area | Preview support |\n|---|---|\n| Mac | Apple Silicon only (`arm64`) |\n| macOS | 14.0+; Sequoia is supported |\n| Python | native arm64 Python 3.10+ |\n| MLX | `python3 -m pip install mlx` in the same native environment |\n| Memory | dynamic preflight; warns below 48 GiB, fails when the selected model\u002Fprofile estimate exceeds 80% of unified memory |\n| Storage | first download requires `max(model_size * 2.5, model_size + 20 GiB)` free on the model-cache filesystem |\n| Docker\u002FOpen WebUI | Docker Desktop current plus previous two macOS major releases |\n\nRun `mtplx doctor --summary`, `mtplx doctor --deep --json`, or `mtplx doctor --bundle` before filing a bug. Bundles are redacted by default under `~\u002F.mtplx\u002Freports\u002F`.\n\n---\n\n## CLI surface\n\n```bash\nmtplx start                 # interactive setup, then chat\nmtplx help                  # detailed help; `mtplx help \u003Ccommand>` for any\nmtplx doctor                # install + model + integration health\nmtplx inspect \u003Cmodel>       # four-tier compatibility report\nmtplx init                  # write ~\u002F.mtplx\u002Fconfig.toml\nmtplx setup                 # download verified model, prepare cache\nmtplx pull                  # download the default HF model safely\nmtplx models                # cached models, validation, size, delete command\nmtplx tune                  # compare AR\u002FD1\u002FD2\u002FD3; saves only a depth that beats AR\nmtplx-tune                  # same Tune command as a dedicated console script\nmtplx run \"...\"             # one-shot ask\nmtplx chat                  # terminal chat\nmtplx start                 # OpenAI\u002FAnthropic-compatible server\nmtplx connect openwebui     # paste settings for Open WebUI\nmtplx openwebui docker-command\nmtplx bench run --suite cold-long-code-192\nmtplx max --install         # install ThermalForge for Sustained Max \u002F Burst\nmtplx max --status          # fan \u002F thermal state\n```\n\nEvery command has `--json` for machine-readable output and `--help` for context-specific docs.\n\n---\n\n## Architecture\n\nThe architectural achievement is **a single-model native-MTP runtime that's mathematically exact at temperature**, with a real serving surface bolted on. There is no second drafter, no greedy hack, and no \"drop in a fast-decode library\" wrapper. Four layers, drawn the way they actually run.\n\n### 0. MLX runtime layer (the kernel stack we own)\n\nMTPLX is not a thin wrapper over stock MLX. The recorded speed lane was measured with the runtime environment described below plus custom Metal kernels registered as primitives. Public wheels do not silently bundle an extra MLX fork; `mtplx doctor --deep --json` reports whether an optional fast MLX fork is active on the user's machine. Treat the fork details here as measured-environment evidence, not an install-time promise that every package manager has replaced MLX under you.\n\nMeasured MLX source-level environment for the public record (optional fork: `mlx-mtplx-0.31.2-qmm`, commit `2377a99f` \"Tune small-M qmv for MTPLX 60TPS path\"):\n\n- **Small-M `qmv` retuning.** The verify forward is dominated by quantized-matrix-vector ops at `M ≈ 3..6` (one position per accepted draft). Stock MLX's `qmv_fast_impl` is tuned for large M and stalls dispatch at small M. Our fork: `BN16` group-size, **4-simdgroup** instead of 2-simdgroup, `unroll_count(4)` on the inner loop. Cuts the verify-MLP region by enough to be the difference between \"MTP loses to AR\" and \"MTP at ~2.24×\".\n- **Source-primitive registration.** Custom kernels (below) are registered through `mlx.core.fast.metal_kernel` and integrated into MLX's graph the same way stock primitives are, so `mx.compile` can fuse around them and `mx.eval` doesn't see them as opaque blocks.\n\nCustom Metal kernels we shipped on top of the fork:\n\n- **`linear-gdn-from-conv-tape`** — the GDN linear-attention path during verify. Records an *innovation tape* of `(token, gate, state-delta)` tuples during the draft phase, then **replays** them deterministically on rollback when a draft is rejected. Replaces stock MLX's `Conv1d` + recurrent-state restore with a single fused kernel that's bit-exact (`max_diff = 0.0` against batched-vs-sequential reference) and shape-stable.\n- **`verify_qmv` (small-M qmv kernel).** Direct successor of dflash-mlx's M=16 idea, retuned for MTPLX's M=3..6 verify shapes. Now subsumed by the MLX-source qmv tuning above for the verify hot path; remains as a standalone primitive for diagnostic regressions.\n- **GraphBank.** A cache of `mx.compile`-compiled verify graphs, keyed by `(suffix_length, depth, profile)`. Each verify shape gets one compiled graph reused across cycles — no per-cycle Python dispatch overhead. Capture-commit + GraphBank together hit `capture_commit_time_s ≈ 0.073 ms` per cycle (vs `verify_time_s ≈ 47 ms` per cycle), i.e. the commit step is three orders of magnitude smaller than the verify itself.\n- **Draft-only 4-bit \u002F 3-bit LM head** built in memory by `scripts\u002Fprobe_draft_lm_head_requant.py`. The target's `lm_head` stays at the model's actual precision (BF16 \u002F INT4 affine); the drafter gets a separate, much smaller LM-head requantized for proposal use only. Cuts draft time by ~29% without touching target accuracy.\n\nRuntime knobs that are enabled when `performance-cold` is explicitly selected:\n\n- `MTPLX_LAZY_VERIFY_LOGITS=1` · `MTPLX_BATCH_TARGET_ARRAYS=1` · `MTPLX_LAZY_MTP_HISTORY_APPEND=1` · `MTPLX_DROP_EVENTS=1` · `MTPLX_SKIP_VERIFY_SNAPSHOT=1`.\n\nNumerical hygiene (these are correctness fixes, not speed):\n\n- **`fp32` `p\u002Fq` ratio** during probability-ratio acceptance. The Leviathan–Chen ratio underflows in BF16 at small `q`; fp32 is the only safe path.\n- **`mx.random.split` per draft position** so each acceptance roll uses an independent RNG key. Without this, depth>1 would silently correlate accept decisions.\n\n\u003Cimg src=\"docs\u002Fassets\u002Freadme\u002Fmlx-runtime.svg\" alt=\"MLX runtime layer\" width=\"100%\" \u002F>\n\n### 1. Single-model runtime\n\nThe target model and the drafter are the **same checkpoint**. Qwen3.6-27B ships native MTP heads; MTPLX uses them as the speculative drafter. Zero RAM cost for a second model, zero distillation, zero \"we trained a drafter\" handoff. The trunk's KV cache obeys a **committed-history contract** (verified against the vLLM CUDA reference at cosine > 0.9998 through D5) so recursive draft depth holds together — that's what lets D2\u002FD3\u002FD4 acceptance reach the 90s instead of collapsing.\n\n\u003Cimg src=\"docs\u002Fassets\u002Freadme\u002Fsingle-model.svg\" alt=\"Single-model runtime\" width=\"100%\" \u002F>\n\n### 2. Speculative cycle (the hot loop)\n\nPer cycle: the MTP head drafts K tokens, the target verifies all K in parallel via one batched forward, **probability-ratio acceptance** (Leviathan–Chen) decides per-position, **residual correction `(p − q)+`** emits a clean replacement on rejection, and a **bonus token** falls out for free when all K accept. Verify cost is paid by `capture_commit` + the `linear-gdn-from-conv-tape` GDN kernel + a **GraphBank** of compiled verify shapes; the math is exact at any temperature.\n\n\u003Cimg src=\"docs\u002Fassets\u002Freadme\u002Fcycle.svg\" alt=\"Speculative cycle\" width=\"100%\" \u002F>\n\n### 3. Serving stack\n\nThe runtime is wrapped in a real serving surface so you can point Open WebUI \u002F Claude Code \u002F Cline \u002F Continue \u002F `curl` \u002F `openai-python` \u002F `anthropic-python` at it. **Engine sessions** keep per-chat state; the **Session Bank** preserves warm-prefix exact state across turns (verified `logits_max_abs_diff = 0.0` against fresh forwards) so multi-turn TTFT doesn't collapse the way a stateless shim would.\n\n\u003Cimg src=\"docs\u002Fassets\u002Freadme\u002Fserving.svg\" alt=\"Serving stack\" width=\"100%\" \u002F>\n\nThe CLI (`mtplx start` \u002F `pull` \u002F `doctor` \u002F `inspect` \u002F `max`) is the on-ramp to all of the above and not the architectural story — it lazy-imports MLX so `--help`, `doctor`, `inspect`, `init`, `setup` work on a fresh venv with no GPU\u002FApple-Silicon stack installed.\n\n---\n\n## What MTPLX is *not*\n\n- It's not DFlash. DFlash uses greedy-argmax prefix matching and breaks the target distribution at T>0. MTPLX implements exact probability-ratio rejection sampling.\n- It's not an external-drafter system. There's no second model. The drafter is the target's own MTP heads.\n- It's not a generic \"speculative decoding library\". It's a runtime + serving stack with an explicit model-compatibility contract.\n- It's not a CUDA project. MTPLX is MLX-native and Apple-Silicon-first. Linux\u002FCUDA is not on the roadmap; for that, use vLLM.\n- It's not finished. v0.2 is an early production release. The **~2.24× multiplier** cold-lane target is met, the long-context fast-prefill path is much stronger, and the sustained-no-fan target is still a future runtime track.\n\n---\n\n## License, citation, and attribution\n\nMTPLX builds on [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx) and the Qwen3-Next model family. The speculative-sampling math follows Leviathan & Chen 2023 (\"Fast Inference from Transformers via Speculative Decoding\") and the MTP heads ship with Qwen. Design and diagnostics are informed by vLLM speculative decoding, vLLM-Metal (issues #188 and #281), DFlash-MLX, DDTree-MLX, and DeepSeek V3.2's `mx.depends` precedent. Optional fan control via [ThermalForge](https:\u002F\u002Fgithub.com\u002FProducerGuy\u002FThermalForge). Model weights and licenses remain governed by their upstream model cards.\n\nMTPLX is released under the [Apache License 2.0](LICENSE): you can use it, modify it, and ship it commercially. If you redistribute MTPLX or derivative works, preserve the Apache license and the attribution notices from [NOTICE](NOTICE) as required by Apache-2.0.\n\nIf MTPLX powers a public project, product, benchmark, article, or research result, please include clear credit in your README, docs, paper, or public writeup:\n\n> Powered by MTPLX by Youssof Altoukhi\n>\n> https:\u002F\u002Fgithub.com\u002Fyoussofal\u002Fmtplx\n\nFor academic or technical writing, cite the repository using [CITATION.cff](CITATION.cff).\n\n— Built by [Youssof Altoukhi](https:\u002F\u002Fgithub.com\u002Fyoussofal). Contributions, bug reports, and benchmark replications welcome via [Issues](https:\u002F\u002Fgithub.com\u002Fyoussofal\u002Fmtplx\u002Fissues).\n","MTPLX 是一个专为苹果 Silicon 设备设计的本地 MTP 推测解码引擎，能够显著提升 Qwen 3.6 27B 模型在温度为0.6时的解码吞吐量。其核心功能包括利用模型内置的MTP头作为推测草稿器，并采用精确的概率比接受和残差校正方法，从而在保持目标模型分布的同时提高解码速度。技术特点上，MTPLX基于MLX构建，无需外部草稿器，支持OpenAI\u002FAnthropic兼容的服务接口。该工具非常适合需要在Mac设备上高效运行大语言模型的应用场景，例如本地AI推理、开发测试等。",2,"2026-06-11 02:31:01","CREATED_QUERY"]