[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1107":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":14,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":28,"discoverSource":29},1107,"dflash-mlx","Aryagm\u002Fdflash-mlx","Aryagm","Exact speculative decoding on Apple Silicon, powered by MLX.",null,"Python",376,35,5,3,0,2,7,6,4.67,"MIT License",false,"main",true,[],"2026-06-12 02:00:23","# dflash-mlx\n\n**DFlash implementation for Apple Silicon, using MLX.**\n\n![Benchmarks](assets\u002Fbenchmark-chart.png)\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fe7a78bca-1a62-42eb-ba75-da32b3b3ad40\n\n## Quick start\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Faryagm\u002Fdflash-mlx.git && cd dflash-mlx\nuv sync\n\nuv run dflash-mlx --max-new-tokens 128\n```\n\nDefaults to Qwen3-4B BF16. First run downloads the target and draft checkpoints into the Hugging Face cache, which is roughly 12 GB for the default pair. Pass `--target-model` and `--draft-model` to override. `dflash-mlx-chat` for interactive chat, `--json` for machine-readable output. Benchmark history is opt-in with `--history` or `--history-file`.\n\n```python\nfrom dflash_mlx import DFlashGenerator\n\n# First run also downloads the default Qwen3-4B target and DFlash draft weights.\nrunner = DFlashGenerator()\nresult = runner.generate(\"Write a quicksort in Python.\", max_new_tokens=128)\nprint(result.text)\n\nfor event in runner.stream(\"Write a quicksort in Python.\", max_new_tokens=128):\n    if not event.finished:\n        print(event.delta, end=\"\", flush=True)\n```\n\nUse `uv run dflash-mlx --stream` or `uv run dflash-mlx-chat --stream` to print\nverified text as it is committed.\n\n## OpenAI-compatible local server\n\nA minimal text-only OpenAI-compatible HTTP server is included for local integrations:\n\n```bash\ndflash-mlx-openai-server \\\n  --host 127.0.0.1 \\\n  --port 8098 \\\n  --model-id qwen35-27b-dflash \\\n  --target-model \u002Fpath\u002Fto\u002Ftarget \\\n  --draft-model \u002Fpath\u002Fto\u002Fdraft\n```\n\nEndpoints:\n- `GET \u002Fhealth`\n- `GET \u002Fv1\u002Fmodels`\n- `POST \u002Fv1\u002Fchat\u002Fcompletions`\n\nCurrent limitations:\n- text-only message content\n- no image input\n\n`POST \u002Fv1\u002Fchat\u002Fcompletions` supports both full responses and streaming SSE chunks\nwith `\"stream\": true`.\n\n## Supported models\n\n| Target | Draft |\n|---|---|\n| `mlx-community\u002FQwen3-4B-bf16` (default) | `z-lab\u002FQwen3-4B-DFlash-b16` |\n| `mlx-community\u002FQwen3.5-4B-MLX-bf16` | `z-lab\u002FQwen3.5-4B-DFlash` |\n\nQwen3.5 support is functional but incomplete. It is not as fast as the Qwen3 path today because Qwen3.5 uses a more complicated hybrid attention stack with recurrent linear-attention state, so exact partial-block acceptance needs custom cache rollback and currently has weaker long-generation acceptance.\n\nUpstream DFlash has checkpoints for Llama 3.1, Qwen3 Coder, Kimi-K2.5, GPT-OSS, and more in the [Hugging Face collection](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fz-lab\u002Fdflash). Adding a new family starts with an adapter in `dflash_mlx\u002Fadapters.py` &mdash; see [ADDING_MODELS.md](ADDING_MODELS.md).\n\n## Benchmarks\n\nFull run details, acceptance stats, and quantized comparisons:\n- [benchmarks\u002Fqwen3-results.md](benchmarks\u002Fqwen3-results.md) &mdash; headline Qwen3 results\n- [benchmarks\u002Fqwen35-results.md](benchmarks\u002Fqwen35-results.md) &mdash; archived Qwen3.5 runs\n\n## How it works\n\n[DFlash](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06036) trains a small block-diffusion model to propose multiple tokens at once. The target verifies them in a single forward pass and accepts the longest correct prefix &mdash; identical output, fewer forward passes, higher throughput.\n\nThe original DFlash targets CUDA. `dflash-mlx` is a native MLX port for Apple Silicon. MLX has no speculative-decoding primitives, so every piece of the draft\u002Fverify loop had to be built from scratch on top of Metal:\n\n- **Hidden-state extraction from the target.** DFlash's drafter conditions on intermediate layer activations, not just logits. We hook into specific Qwen layers and surface those tensors without breaking the standard forward path or the KV cache, so a single target pass gives us both verification logits and the hidden states the next draft block needs.\n\n- **Parallel block proposal.** The draft model runs a block-diffusion denoising loop to propose several tokens at once. This runs entirely on the GPU with its own cache, sharing tokenization and positional context with the target.\n\n- **Single-pass batched verification.** Every proposed block is verified in one target forward pass. The target's logits are compared greedily against the draft's samples; we accept the longest matching prefix plus one bonus correction token, which is what makes the output bit-for-bit identical to plain target decoding.\n\n- **Per-layer KV cache rollback on rejection.** When the target rejects the tail of a proposal, the KV cache has to be rewound to the exact accepted length &mdash; per layer, because Qwen3.5 mixes full attention, sliding-window attention, and recurrent linear-attention state, and each has its own cache shape and rollback rule. Plain MLX caches don't expose this; we extend them.\n\n- **Pluggable adapters.** Target-specific concerns (layer ids to tap, cache types, stop tokens, chat template) are isolated in `dflash_mlx\u002Fadapters.py`. The core draft\u002Fverify loop is architecture-agnostic, so adding a new family is one adapter file rather than a rewrite.\n\n- **Warm-path throughput engineering.** MLX kernel compilation, lazy evaluation, and graph caching all affect the numbers. The bench CLI separates warmup from measurement and pins evaluation points so the reported tok\u002Fs reflects steady-state Metal performance, not first-run overhead.\n\n## Citation\n\n```bibtex\n@article{chen2026dflash,\n  title   = {DFlash: Block Diffusion for Flash Speculative Decoding},\n  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},\n  journal = {arXiv preprint arXiv:2602.06036},\n  year    = {2026}\n}\n```\n\n## License\n\nMIT\n","dflash-mlx 是一个基于 MLX 技术在 Apple Silicon 平台上实现的精确推测解码项目。其核心功能包括通过训练小型块扩散模型来一次性提出多个令牌，并由目标模型进行验证，从而加速文本生成过程。该项目使用 Python 语言编写，支持 Qwen3-4B 和 Qwen3.5-4B 等模型，默认情况下会从 Hugging Face 缓存中下载相关权重文件。此外，还提供了一个与 OpenAI API 兼容的本地 HTTP 服务器，方便集成到需要高效文本生成的应用场景中，如聊天机器人开发、自动化内容创作等。dflash-mlx 适合对生成速度有较高要求且主要处理纯文本任务的情况。","2026-06-11 02:41:40","CREATED_QUERY"]