[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2042":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":16,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":29,"discoverSource":30},2042,"ddtree-mlx","humanrouter\u002Fddtree-mlx","humanrouter","Tree-based speculative decoding for Apple Silicon (MLX). ~10-15% faster than DFlash on code, ~1.5x over autoregressive. First MLX port with custom Metal kernels for hybrid model support.",null,"Python",139,11,2,0,1,3,3.24,false,"main",[21,22,23,24,25],"apple-silicon","inference","llm","mlx","speculative-decoding","2026-06-12 02:00:36","# DDTree-MLX\n\n**Tree-based speculative decoding for Apple Silicon.** ~10-15% faster than DFlash, ~1.5x faster than autoregressive on Qwen 3.5 27B.\n\nDDTree extends [DFlash](https:\u002F\u002Fgithub.com\u002Fbstnxbt\u002Fdflash-mlx) speculative decoding by building a **draft tree** from per-position logits and verifying the entire tree in one forward pass. Instead of betting on a single draft sequence, DDTree explores multiple likely continuations simultaneously, accepting more tokens per verification cycle.\n\nBased on the paper [*Accelerating Speculative Decoding with Block Diffusion Draft Trees*](https:\u002F\u002Fliranringel.github.io\u002Fddtree\u002FDDTree.pdf) by Liran Ringel & Yaniv Romano. This is the first MLX port for Apple Silicon, with custom Metal kernels for hybrid model support.\n\n## Performance\n\nMeasured on Mac Studio M3 Ultra 256GB, Qwen 3.5 27B 4-bit, code generation prompt at 8K max tokens:\n\n| Method | tok\u002Fs | vs Autoregressive | Acceptance |\n|--------|------:|------------------:|-----------:|\n| Autoregressive | 27.9 | 1.0x | — |\n| DFlash | 38.6 | **1.38x** | 85% |\n| **DFlash + DDTree** | **42.3** | **1.52x** | 4.2\u002Fcycle |\n\nDDTree adds **~10-15% on top of DFlash** for code and structured content where draft acceptance is high. Output is lossless -- every token is verified against the target model.\n\n### When DDTree Helps (and When It Doesn't)\n\n| Content Type | DFlash Acceptance | DDTree Benefit |\n|-------------|------------------:|:---------------|\n| Code generation | 85%+ | **+10-15%** over DFlash — tree catches rejected tokens with backup branches |\n| Structured\u002Ffactual | 70-80% | **+10-15%** — moderate acceptance leaves room for tree alternatives |\n| Creative prose | 5-10% | **~0%** — low acceptance means most tree branches are wrong too; DDTree roughly equals autoregressive |\n\nDDTree's advantage depends entirely on draft model acceptance. When the draft model predicts well (code, structured output), the tree's backup branches catch occasional misses. When the draft model struggles (creative writing, open-ended prose), tree branches are just as wrong as the primary guess, and the tree overhead eats any gain.\n\n## How It Works\n\n1. **Draft**: The DFlash block diffusion model generates per-position token probabilities in parallel\n2. **Tree Build**: A heap-based algorithm constructs an optimal draft tree from the top-K tokens at each position, maximizing coverage under a node budget\n3. **Tree Verify**: All tree nodes are verified through the target model in one forward pass using tree attention masks (ancestor-only visibility) and per-token RoPE positions\n4. **Tree Walk**: Greedy walk through the verified tree to find the longest accepted path\n5. **Commit**: The accepted path's cache state is installed directly via per-node state capture (zero-cost commit)\n\n### Hybrid Model Support\n\nQwen 3.5 27B is a hybrid architecture with 48 GatedDeltaNet (recurrent) layers and 16 full attention layers. DDTree handles this with:\n\n- **Attention layers**: Process all tree nodes in parallel via custom tree attention masks\n- **Recurrent layers**: A custom Metal kernel performs parent-indexed GatedDelta recurrence, forking state at each branch point so every tree path gets exact logits\n- **Tree-aware commit**: Accepted path's recurrent state is installed directly from captured per-node states, eliminating the need for re-forward passes\n\n## Installation\n\nRequires Python 3.11+ and Apple Silicon (M1\u002FM2\u002FM3\u002FM4).\n\n```bash\n# Install dflash-mlx (required dependency)\npip install dflash-mlx\n\n# Clone and install ddtree-mlx\ngit clone https:\u002F\u002Fgithub.com\u002Fhumanrouter\u002Fddtree-mlx.git\ncd ddtree-mlx\npip install -e .\n```\n\nThe target model and DFlash drafter will be downloaded automatically on first run from Hugging Face:\n- Target: `mlx-community\u002FQwen3.5-27B-4bit` (~16GB)\n- Drafter: `z-lab\u002FQwen3.5-27B-DFlash` (~3GB)\n- Total memory: ~19GB\n\n## Usage\n\n### OpenAI-Compatible Server\n\n```bash\npython ddtree_server.py --port 8006\n```\n\nThen use any OpenAI-compatible client:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http:\u002F\u002Flocalhost:8006\u002Fv1\", api_key=\"unused\")\nresponse = client.chat.completions.create(\n    model=\"ddtree\",\n    messages=[{\"role\": \"user\", \"content\": \"Explain TCP vs UDP\"}],\n    max_tokens=2048,\n)\nprint(response.choices[0].message.content)\n```\n\n### Python API\n\n```python\nfrom dflash_mlx.generate import load_runtime_components, get_stop_token_ids\nfrom ddtree_mlx.runtime import generate_ddtree_once\n\n# Load models (downloads from HF on first run)\ntarget_model, tokenizer, draft_model, _ = load_runtime_components(\n    model_ref=\"mlx-community\u002FQwen3.5-27B-4bit\"\n)\n\n# Tokenize\nprompt_tokens = list(tokenizer.apply_chat_template(\n    [{\"role\": \"user\", \"content\": \"Write a Python quicksort\"}],\n    tokenize=True, add_generation_prompt=True, enable_thinking=False,\n))\n\n# Generate\nresult = generate_ddtree_once(\n    target_model=target_model,\n    draft_model=draft_model,\n    tokenizer=tokenizer,\n    prompt_tokens=prompt_tokens,\n    max_new_tokens=2048,\n    tree_budget=4,\n    stop_token_ids=get_stop_token_ids(tokenizer),\n)\n\nprint(tokenizer.decode(result[\"generated_token_ids\"]))\nprint(f\"{result['tokens_per_second']:.1f} tok\u002Fs, \"\n      f\"{result['avg_acceptance']:.1f} tokens\u002Fcycle, \"\n      f\"{result['fast_path_ratio']:.0%} fast path\")\n```\n\n### Benchmarking\n\n```bash\npython benchmark.py --max-tokens 2048 --budgets 4 --prompts 3\n```\n\n## Configuration\n\n| Environment Variable | Default | Description |\n|---------------------|---------|-------------|\n| `DDTREE_BUDGET` | `4` | Tree node budget (excluding root). Budget 4 is optimal for hybrid models. |\n| `DDTREE_EXACT_COMMIT` | `0` | Re-forward accepted tokens sequentially (slow but guaranteed lossless). Usually unnecessary — the kernel precision fix makes tree commit match sequential. |\n| `DDTREE_TREE_AWARE_LINEAR` | `1` | Enable parent-state forking for recurrent layers (recommended). |\n| `DDTREE_TREE_KERNEL` | `1` | Use custom Metal kernel for tree-aware GatedDelta recurrence. |\n| `DDTREE_TREE_CONV_KERNEL` | `1` | Use Metal kernel for parent-aware causal conv inside GatedDelta verification. |\n| `DDTREE_EXACT_TREE_ATTENTION` | `0` | Opt-in exact prefix\u002Ftree attention without a prefix-width mask. Set to `auto` for long-context testing. |\n| `DDTREE_EXACT_TREE_ATTENTION_MIN_PREFIX` | `8192` | Prefix length where exact split attention turns on in `auto` mode. |\n| `DDTREE_DFLASH_CONTROLLER` | `0` | Opt-in in-place controller that can switch future cycles to DFlash after sustained probe wins. |\n| `DDTREE_PROFILE_VERIFY` | `0` | Profile linear vs attention layer timing within tree verify. Use `detail` for per-operation timings. |\n| `DDTREE_PROFILE_DETAIL` | `0` | Enable detailed synchronized verify timings when `DDTREE_PROFILE_VERIFY` is set. |\n\n## Architecture\n\n```\nddtree_mlx\u002F\n  tree.py       # Heap-based tree construction (Algorithm 1 from the paper)\n  compile.py    # Converts tree structure to MLX tensors (masks, positions, DFS order)\n  verify.py     # Custom forward pass: tree attention + parent-indexed recurrence\n  kernels.py    # Metal kernels for tree-aware conv and GatedDelta state update\n  cache.py      # Cache management: snapshot, rollback, tree-aware path commit\n  runtime.py    # Main generate loop: draft -> build -> verify -> walk -> commit\nddtree_server.py  # OpenAI-compatible FastAPI server\nbenchmark.py      # Benchmark script (DDTree vs DFlash comparison)\n```\n\n## Quantization & Model Compatibility\n\nDDTree works at the architecture level (tree attention masks, per-token RoPE, parent-indexed recurrence), so it applies across **any quantization** of the same model family. For Qwen 3.5 27B on M3 Ultra:\n\n| Quantization | AR tok\u002Fs | Memory | DDTree estimated | vs AR |\n|-------------|----------:|-------:|-----------------:|------:|\n| 4-bit | 55 | ~16GB | ~73-95 tok\u002Fs | ~1.9-2.6x |\n| 6-bit | 56 | ~22GB | ~74-97 tok\u002Fs | ~1.9-2.6x |\n| mxfp8 | 57 | ~30GB | ~75-99 tok\u002Fs | ~1.9-2.6x |\n| bf16 | 57 | ~54GB | ~75-99 tok\u002Fs | ~1.9-2.6x |\n\nAR speeds are nearly identical across quantizations on M3 Ultra (memory bandwidth bottlenecked). Since DDTree's speedup is a multiplier on top of the base speed, the same ~2-2.6x ratio applies to all of them. The practical sweet spot is **4-bit** — same speed as bf16 but 3.4x less memory.\n\n### The Draft Model is Key\n\nDDTree's performance depends entirely on having a good **DFlash draft model** for the target model. The draft model (`z-lab\u002FQwen3.5-27B-DFlash`) is a small block diffusion model (~3GB) specifically trained to predict what Qwen 3.5 27B will say next. The better the draft model predicts, the higher the acceptance rate, and the bigger DDTree's speedup.\n\nCurrently, DFlash drafters exist for the **Qwen 3.5 family**. As more DFlash drafters get trained for other model families (Llama, Mistral, etc.), DDTree automatically extends to them — no code changes needed. The tree construction, verification, and commit logic are model-agnostic; only the draft model needs to match the target.\n\nIf no DFlash drafter is available for a model, DDTree cannot be used. This is the main adoption constraint — the acceleration is only as good as the draft model that powers it.\n\n## Findings & Insights\n\nSee [BENCHMARKS.md](BENCHMARKS.md) for detailed results, including:\n\n- **What worked**: Metal kernels for tree-aware conv\u002Frecurrent verification, zero-cost commit via per-node state capture, eval sync point reduction\n- **What didn't work**: Attention-only tree verify (LM head needs all 64 layers), alternative tree shapes (chain, hybrid, root-wide), split prefix\u002Ftree attention, adaptive budget controller\n- **The fundamental constraint**: On hybrid models (75% recurrent layers), tree verification has limited parallelism. DDTree's advantage comes from better acceptance density -- the tree concentrates budget on the most probable tokens. Pure-attention models (Llama, standard Qwen) would benefit more.\n\n## Citation\n\n```bibtex\n@article{ringel2025ddtree,\n  title={Accelerating Speculative Decoding with Block Diffusion Draft Trees},\n  author={Ringel, Liran and Romano, Yaniv},\n  year={2025},\n  url={https:\u002F\u002Fliranringel.github.io\u002Fddtree\u002F}\n}\n```\n\n## License\n\nMIT\n","DDTree-MLX 是一个针对 Apple Silicon 设备优化的树状推测解码项目，相比 DFlash 在代码生成上快约 10-15%，比自回归模型快约 1.5 倍。其核心功能是通过构建一个从每个位置的 logits 生成的草稿树，并在一次前向传递中验证整个树，从而同时探索多个可能的延续路径，提高每轮验证周期接受的令牌数量。该项目基于论文《使用块扩散草稿树加速推测解码》，是首个为 Apple Silicon 设备定制 Metal 内核支持混合模型的 MLX 端口。适用于需要高效文本生成尤其是代码和结构化内容生成的场景，在这些场景下 DDTree 能够显著提升生成速度而不损失输出质量。","2026-06-11 02:47:43","CREATED_QUERY"]