[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-867":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},867,"dflash-mlx","bstnxbt\u002Fdflash-mlx","bstnxbt","Lossless DFlash speculative decoding for MLX on Apple Silicon",null,"Python",728,54,4,10,0,5,13,63,15,9.22,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:00:19","\u003Cp align=\"center\">\n  \u003Ch1 align=\"center\">dflash-mlx\u003C\u002Fh1>\n  \u003Cp align=\"center\">DFlash speculative decoding for Apple Silicon (MLX)\u003C\u002Fp>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fplatform-Apple%20Silicon-black?logo=apple\" alt=\"Apple Silicon\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%2B-blue?logo=python\" alt=\"Python 3.10+\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-blue\" alt=\"License\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMLX-stock-red\" alt=\"Stock MLX\">\n\u003C\u002Fp>\n\nPaper: [DFlash: Block Diffusion for Flash Speculative Decoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06036) (Chen et al., 2026)\n\nBlock-diffusion draft generates 16 tokens in one pass. Target verifies in one pass. Output is lossless — every emitted token is verified against the target model before it is committed.\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa9be2b48-3264-4970-b836-c876b0b7fdda\n\n## How it works\n\n- A small draft model (~1B params) generates 16 tokens in parallel with block diffusion.\n- The target model verifies those 16 tokens in a single forward pass.\n- Greedy acceptance keeps the correct prefix and rejects the rest.\n- Lossless: every emitted token is the target model's greedy argmax at verification time. Output can still differ from pure AR because of MLX dispatch divergence, but no unverified token is ever emitted.\n- Built on stock MLX with a small number of targeted Metal kernels where rollback and long-context verify need tighter numerical control.\n\n## Technical details\n\n- **Tape-replay rollback** — instead of snapshotting and restoring the full GatedDeltaNet state, dflash-mlx records an innovation tape during verify and replays only the accepted steps through a custom Metal kernel. Keeps rollback cost low and preserves acceptance over long generations.\n- **JIT SDPA 2-pass** — long-context verify (`N >= 1024`) uses a custom Metal attention kernel that stays numerically aligned with stock MLX attention.\n- **Verify-specialized int4 qmm** (`verify_qmm`) — custom Metal simdgroup-MMA kernel for the M=16 quantized matmul that dominates the target verify step. Two shape-adaptive variants (`mma2big`, `mma2big_pipe` with K-split + double-buffered staging). Auto-enabled on MoE targets and dense models with ≥40 layers.\n- **Numerical coherence** — bf16-sensitive paths, including recurrent state replay and small projections, are stabilized across speculative cycles so accepted tokens stay consistent.\n- **Prefix cache (L1+L2)** — RAM snapshots of target KV + GDN recurrent state + captured hidden + last logits, with optional SSD spill, byte\u002Fentry budgets, and automatic eviction. Hits skip prefill on revisited prompts. This hot\u002Fcold cache hierarchy is inspired by [oMLX](https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx)'s tiered KV cache work, but dflash-mlx stores DFlash prefix snapshots rather than active paged-KV blocks.\n\n## Benchmarks\n\nApple M5 Max, 64 GB unified memory, MLX 0.31.1. Protocol: stock `mlx_lm.stream_generate` baseline vs DFlash, sequential, 3 repeats, median, 60s cooldown. Generation prompt: `\"The function $f$ satisfies the functional equation \\[ f(x) + f(y) = f(x + y) - xy - 1 \\] for all real numbers $x$ and $y$. If $f(1) = 1$, then find all integers $n$ such that $f(n) = n$. Enter all such integers, separated by commas. Please reason step by step, and put your final answer within \\boxed{}.\"`\n\n| Model | Tokens | Baseline | DFlash | Speedup | Acceptance |\n|-------|--------|----------|--------|---------|------------|\n| Qwen3.5-4B | 1024 | 53.80 tok\u002Fs | 182.87 tok\u002Fs | 3.40x | 86.43% |\n| Qwen3.5-4B | 2048 | 53.90 tok\u002Fs | 188.70 tok\u002Fs | 3.49x | 87.70% |\n| Qwen3.5-4B | 4096 | 53.49 tok\u002Fs | 195.84 tok\u002Fs | 3.66x | 88.35% |\n| Qwen3.5-4B | 8192 | 53.28 tok\u002Fs | 160.51 tok\u002Fs | 3.02x | 87.30% |\n| Qwen3.5-9B | 1024 | 30.95 tok\u002Fs | 135.34 tok\u002Fs | 4.37x | 89.55% |\n| Qwen3.5-9B | 2048 | 30.70 tok\u002Fs | 113.00 tok\u002Fs | 3.65x | 89.16% |\n| Qwen3.5-9B | 4096 | 30.56 tok\u002Fs | 94.59 tok\u002Fs | 3.06x | 88.31% |\n| Qwen3.5-9B | 8192 | 29.43 tok\u002Fs | 66.94 tok\u002Fs | 2.22x | 86.67% |\n| Qwen3.5-27B-4bit | 1024 | 33.55 tok\u002Fs | 79.02 tok\u002Fs | 2.37x | 90.04% |\n| Qwen3.5-27B-4bit | 2048 | 33.10 tok\u002Fs | 70.21 tok\u002Fs | 2.12x | 89.60% |\n| Qwen3.5-27B-4bit | 4096 | 31.47 tok\u002Fs | 55.68 tok\u002Fs | 1.77x | 88.38% |\n| Qwen3.5-27B-4bit | 8192 | 33.88 tok\u002Fs | 45.29 tok\u002Fs | 1.34x | 85.97% |\n| Qwen3.5-35B-A3B-4bit | 1024 | 143.03 tok\u002Fs | 248.85 tok\u002Fs | 1.76x | 89.26% |\n| Qwen3.5-35B-A3B-4bit | 2048 | 141.43 tok\u002Fs | 255.01 tok\u002Fs | 1.81x | 89.75% |\n| Qwen3.5-35B-A3B-4bit | 4096 | 141.49 tok\u002Fs | 216.47 tok\u002Fs | 1.53x | 88.50% |\n| Qwen3.5-35B-A3B-4bit | 8192 | 138.59 tok\u002Fs | 170.39 tok\u002Fs | 1.22x | 86.41% |\n| Qwen3.6-35B-A3B-4bit | 1024 | 138.26 tok\u002Fs | 300.33 tok\u002Fs | 2.20x | 91.02% |\n| Qwen3.6-35B-A3B-4bit | 2048 | 139.03 tok\u002Fs | 252.93 tok\u002Fs | 1.82x | 89.60% |\n| Qwen3.6-35B-A3B-4bit | 4096 | 134.50 tok\u002Fs | 208.40 tok\u002Fs | 1.56x | 88.43% |\n| Qwen3.6-35B-A3B-4bit | 8192 | 133.20 tok\u002Fs | 177.45 tok\u002Fs | 1.33x | 87.01% |\n\nPer-run JSON: [`benchmark\u002Fresults\u002F`](benchmark\u002Fresults\u002F). Reproduce on your hardware with `dflash benchmark`.\n\n## Install\n\n```bash\npip install dflash-mlx\n```\n\nOptional benchmark dataset support:\n\n```bash\npip install \"dflash-mlx[bench]\"\n```\n\n## Quick start\n\n```bash\nPROMPT='The function $f$ satisfies the functional equation \\[ f(x) + f(y) = f(x + y) - xy - 1 \\] for all real numbers $x$ and $y$. If $f(1) = 1$, then find all integers $n$ such that $f(n) = n$. Enter all such integers, separated by commas. Please reason step by step, and put your final answer within \\boxed{}.'\n\n# One-shot generation, draft auto-resolved\ndflash generate --model Qwen\u002FQwen3.5-9B --prompt \"$PROMPT\"\n\n# Server (OpenAI-compatible)\ndflash serve \\\n  --model mlx-community\u002FQwen3.6-27B-4bit \\\n  --draft z-lab\u002FQwen3.6-27B-DFlash \\\n  --port 8000\n\n# Canonical local benchmark\ndflash benchmark \\\n  --model Qwen\u002FQwen3.5-9B \\\n  --prompt \"$PROMPT\" \\\n  --max-tokens 1024 \\\n  --repeat 3 \\\n  --cooldown 60 \\\n  --no-eos\n```\n\nSend a request:\n\n```bash\ncurl http:\u002F\u002F127.0.0.1:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d \"{\n    \\\"model\\\": \\\"mlx-community\u002FQwen3.6-27B-4bit\\\",\n    \\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"$PROMPT\\\"}],\n    \\\"max_tokens\\\": 1024,\n    \\\"stream\\\": true\n  }\"\n```\n\nCompatible with OpenCode, aider, Continue, Open WebUI, and any OpenAI-compatible client. Tool calls, streaming, and chat templates all flow through. Short responses may take the target-only fast path; pass `--fastpath-max-tokens 0` to force DFlash on every request.\n\nInspect live server metrics:\n\n```bash\ncurl http:\u002F\u002F127.0.0.1:8000\u002Fmetrics\n```\n\n`prefill_tok_s_physical` counts only tokens actually computed after prefix-cache\nrestore. `prefill_tok_s_apparent` uses the full logical prompt length over the\nsame user-visible prefill wall time. `current_request` shows an in-flight\nprefill\u002Fdecode, `recent_requests` keeps the last 32 completed requests, and\n`rss_gb` reports process resident memory. `wired_gb` stays `null` unless a true\nper-process wired-memory source is available.\nThe endpoint is for live debugging and benchmark visibility; it does not create\nbenchmark artifacts.\n\nEnable Qwen reasoning mode when needed:\n\n```bash\ndflash serve --model mlx-community\u002FQwen3.6-27B-4bit --enable-thinking\n```\n\n## Tested models\n\nOptimized for Qwen3.5 \u002F Qwen3.6 hybrid GatedDeltaNet + attention targets. Qwen3\n(pure attention) targets work but skip the tape-replay rollback path. Gemma4\ntargets use the Gemma4 adapter; prefix snapshots stay disabled for Gemma4 until\nsnapshot parity is proven.\n\n| Target | Draft |\n|--------|-------|\n| [Qwen\u002FQwen3.5-4B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3.5-4B) | [z-lab\u002FQwen3.5-4B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-4B-DFlash) |\n| [Qwen\u002FQwen3.5-9B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3.5-9B) | [z-lab\u002FQwen3.5-9B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-9B-DFlash) |\n| [mlx-community\u002FQwen3.5-27B-4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-27B-4bit) | [z-lab\u002FQwen3.5-27B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-27B-DFlash) |\n| [mlx-community\u002FQwen3.5-35B-A3B-4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.5-35B-A3B-4bit) | [z-lab\u002FQwen3.5-35B-A3B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-35B-A3B-DFlash) |\n| [mlx-community\u002FQwen3.6-27B-4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.6-27B-4bit) | [z-lab\u002FQwen3.6-27B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.6-27B-DFlash) |\n| [mlx-community\u002FQwen3.6-35B-A3B-4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002FQwen3.6-35B-A3B-4bit) | [z-lab\u002FQwen3.6-35B-A3B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.6-35B-A3B-DFlash) |\n| [Qwen\u002FQwen3-4B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-4B) | [z-lab\u002FQwen3-4B-DFlash-b16](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3-4B-DFlash-b16) |\n| [Qwen\u002FQwen3-8B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-8B) | [z-lab\u002FQwen3-8B-DFlash-b16](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3-8B-DFlash-b16) |\n| [mlx-community\u002Fgemma-4-31b-it-4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002Fgemma-4-31b-it-4bit) | [z-lab\u002Fgemma-4-31B-it-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002Fgemma-4-31B-it-DFlash) |\n| [mlx-community\u002Fgemma-4-26b-a4b-it-4bit](https:\u002F\u002Fhuggingface.co\u002Fmlx-community\u002Fgemma-4-26b-a4b-it-4bit) | [z-lab\u002Fgemma-4-26B-A4B-it-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002Fgemma-4-26B-A4B-it-DFlash) |\n\n```bash\ndflash models\n```\n\nModels without a matching DFlash draft are rejected. Pass `--draft` explicitly to override the registry.\n\n## CLI\n\n```\ndflash serve      # OpenAI-compatible server\ndflash generate   # one-shot local generation\ndflash benchmark  # baseline-vs-DFlash runtime benchmark\ndflash doctor     # environment and config checks\ndflash profiles   # list runtime presets\ndflash models     # list supported target\u002Fdraft pairs\n```\n\n## Profiles\n\nReadable defaults. Explicit CLI flags override them.\n\n| Profile | Prefill | Prefix cache | L1 budget | L2 | Intent |\n|---|---:|---|---|---|---|\n| `balanced` | 4096 | on | 4 \u002F 8 GiB | off | default coding sessions |\n| `fast` | 8192 | on | 4 \u002F 16 GiB | off | throughput first |\n| `low-memory` | 1024 | on | 2 \u002F 2 GiB | off | lower memory pressure |\n| `long-session` | 4096 | on | 8 \u002F 8 GiB | on \u002F 50 GiB | prefix revisits |\n\n```bash\ndflash profiles\ndflash serve --profile fast --model Qwen\u002FQwen3.5-9B\ndflash serve --profile long-session --model mlx-community\u002FQwen3.6-27B-4bit \\\n  --prefix-cache-l2-dir .artifacts\u002Fdflash\u002Fl2\n```\n\n## Common server controls\n\n```bash\n# Force DFlash even for short responses\ndflash serve --model Qwen\u002FQwen3.5-9B --fastpath-max-tokens 0\n\n# Tune prefill batching\ndflash serve --model Qwen\u002FQwen3.5-9B --prefill-step-size 8192\n\n# Diagnostics\ndflash serve --model Qwen\u002FQwen3.5-9B --diagnostics basic   # request + cache events\ndflash serve --model Qwen\u002FQwen3.5-9B --diagnostics full    # + memory waterfall + cycle timings\n\n# Bound L1 prefix snapshots\ndflash serve --model Qwen\u002FQwen3.5-9B \\\n  --prefix-cache-max-entries 2 \\\n  --prefix-cache-max-bytes 2GB\n\n# Enable SSD L2 spill\ndflash serve --model Qwen\u002FQwen3.5-9B \\\n  --prefix-cache-l2 \\\n  --prefix-cache-l2-dir .artifacts\u002Fdflash\u002Fl2 \\\n  --prefix-cache-l2-max-bytes 50GB\n```\n\nDiagnostics artifacts land in `.artifacts\u002Fdflash\u002Fdiagnostics\u002F\u003Ctimestamp>-serve-\u003Cmode>\u002F`. `basic` writes request and cache events; `full` adds the memory waterfall and per-cycle timings. Use `full` for diagnosis, not for throughput claims.\n\n## Features\n\n- **Auto draft resolution** — no manual `--draft` flag needed for registered targets\n- **Streaming** — token-by-token output (CLI + SSE)\n- **Chat templates** — enabled by default\n- **Recurrent rollback** — `RecurrentRollbackCache` keeps GatedDeltaNet state coherent across speculative verify and rollback\n- **Verify-specialized int4 qmm** — custom M=16 Metal kernel auto-enabled on MoE and dense ≥40-layer targets; falls back to stock `mx.quantized_matmul` everywhere else\n- **Prefix cache L1+L2** — RAM snapshots with optional SSD spill, budget-based eviction, and hybrid-architecture support\n- **Diagnostics** — opt-in structured artifacts under `.artifacts\u002Fdflash\u002Fdiagnostics\u002F`\n\n## Roadmap\n\n- **Adaptive block size** — vary draft block length per cycle based on observed acceptance regime instead of a fixed 16\n- **More architecture backends** — add new target families only with\n  family-specific cache layout, attention masks, logits post-processing, hidden\n  capture, rollback\u002Ftrim behavior, and parity tests.\n- **Kernel work where it matters** — optimize family-specific hot paths only\n  after the backend contract and parity tests are stable.\n- **Tool-call regime auto-fallback** — switch to target-only AR when speculative surplus goes negative on structured outputs\n- **Sustained acceptance at long context** — draft KV cache window scaling and long-context verify optimization\n\n## Citation\n\n```bibtex\n@misc{chen2026dflash,\n  title={DFlash: Block Diffusion for Flash Speculative Decoding},\n  author={Jian Chen and Yesheng Liang and Zhijian Liu},\n  year={2026},\n  eprint={2602.06036},\n  archivePrefix={arXiv},\n  primaryClass={cs.CL},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06036}\n}\n```\n\n## License\n\nApache-2.0\n","dflash-mlx 是一个针对 Apple Silicon 平台的无损 DFlash 投机解码项目。它利用一个小规模的草案模型（约 10 亿参数）并行生成 16 个 token，再由目标模型在单次前向传递中验证这些 token，确保每个输出的 token 都经过目标模型的验证。该项目基于 Python 3.10+ 构建，并使用了定制的 Metal 内核来优化回滚、长上下文验证和量化矩阵乘法等关键步骤，从而实现高效的解码过程。dflash-mlx 适用于需要高性能文本生成的应用场景，特别是在 Apple Silicon 设备上进行大规模语言模型推理时能够显著提升效率。",2,"2026-06-11 02:39:54","CREATED_QUERY"]