[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11197":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":9,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},11197,"dflash","z-lab\u002Fdflash","z-lab","DFlash: Block Diffusion for Flash Speculative Decoding",null,"https:\u002F\u002Fgithub.com\u002Fz-lab\u002Fdflash","Python",5016,363,38,64,0,68,261,612,204,110.68,false,"main","2026-06-12 04:00:54","# DFlash: Block Diffusion for Flash Speculative Decoding\n[**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06036) | [**Blog**](https:\u002F\u002Fz-lab.ai\u002Fprojects\u002Fdflash\u002F) | [**Models**](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fz-lab\u002Fdflash)\n\n**DFlash** is a lightweight **block diffusion** model designed for speculative decoding. It enables efficient and high-quality parallel drafting.\n\n![DFlash Architecture](https:\u002F\u002Fraw.githubusercontent.com\u002Fjianc99\u002Fjianc99.github.io\u002Fmaster\u002Fimages\u002Fdflash_system.png)\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F5b29cabb-eb95-44c9-8ffe-367c0758de8c\n\n## Supported Models\n\n| Model | DFlash Draft |\n|---|---|\n| gemma-4-26B-A4B-it | [z-lab\u002Fgemma-4-26B-A4B-it-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002Fgemma-4-26B-A4B-it-DFlash) |\n| gemma-4-31B-it | [z-lab\u002Fgemma-4-31B-it-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002Fgemma-4-31B-it-DFlash) |\n| Qwen3.6-27B | [z-lab\u002FQwen3.6-27B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.6-27B-DFlash) |\n| Qwen3.6-35B-A3B | [z-lab\u002FQwen3.6-35B-A3B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.6-35B-A3B-DFlash) |\n| MiniMax-M2.5 (Preview) | [z-lab\u002FMiniMax-M2.5-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FMiniMax-M2.5-DFlash) |\n| Kimi-K2.5 | [z-lab\u002FKimi-K2.5-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FKimi-K2.5-DFlash) |\n| Qwen3.5-4B | [z-lab\u002FQwen3.5-4B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-4B-DFlash) |\n| Qwen3.5-9B | [z-lab\u002FQwen3.5-9B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-9B-DFlash) |\n| Qwen3.5-27B | [z-lab\u002FQwen3.5-27B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-27B-DFlash) |\n| Qwen3.5-35B-A3B | [z-lab\u002FQwen3.5-35B-A3B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-35B-A3B-DFlash) |\n| Qwen3.5-122B-A10B | [z-lab\u002FQwen3.5-122B-A10B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.5-122B-A10B-DFlash) |\n| Qwen3-Coder-Next | [z-lab\u002FQwen3-Coder-Next-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3-Coder-Next-DFlash) |\n| Qwen3-Coder-30B-A3B | [z-lab\u002FQwen3-Coder-30B-A3B-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3-Coder-30B-A3B-DFlash) |\n| gpt-oss-20b | [z-lab\u002Fgpt-oss-20b-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002Fgpt-oss-20b-DFlash) |\n| gpt-oss-120b | [z-lab\u002Fgpt-oss-120b-DFlash](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002Fgpt-oss-120b-DFlash) |\n| Qwen3-4B (non-thinking) | [z-lab\u002FQwen3-4B-DFlash-b16](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3-4B-DFlash-b16) |\n| Qwen3-8B (non-thinking) | [z-lab\u002FQwen3-8B-DFlash-b16](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3-8B-DFlash-b16) |\n| Llama-3.1-8B-Instruct | [z-lab\u002FLLaMA3.1-8B-Instruct-DFlash-UltraChat](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FLLaMA3.1-8B-Instruct-DFlash-UltraChat) |\n| DeepSeek-V4-Flash | Coming soon |\n| DeepSeek-V4-Pro | Coming soon |\n| MiniMax-M2.7 | Coming soon |\n| GLM-5.1 | Coming soon |\n\n> Feel free to open a GitHub issue to request support for additional models. We will also open-source the training recipe soon, so you can train your own DFlash draft model to accelerate any LLM.\n\n## 📦 Installation\n\nUse a separate virtual environment for each to avoid conflict.\n\n| Backend | Install command |\n|---|---|\n| **Transformers** | `uv pip install -e \".[transformers]\"` |\n| **SGLang** | `uv pip install -e \".[sglang]\"` |\n| **vLLM** | See below |\n| **MLX** (Apple Silicon) | `pip install -e \".[mlx]\"` |\n\n**vLLM:** vLLM v0.20.1+ includes core DFlash support. Use the standard install for most models:\n```bash\nuv pip install -e \".[vllm]\"\n```\n\nGemma4 DFlash currently needs our temporary vLLM Gemma4 build. Docker is recommended:\n```bash\ndocker pull ghcr.io\u002Fz-lab\u002Fvllm-openai:gemma4-dflash-cu130\n```\n\nSource fallback for Gemma4:\n```bash\nuv pip install -U --torch-backend=auto \\\n  \"vllm @ git+https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm.git@refs\u002Fpull\u002F41703\u002Fhead\"\n```\n\nNewer non-Gemma4 SWA draft models use the SWA support branch:\n```bash\nuv pip install -U --torch-backend=auto \\\n  \"vllm @ git+https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm.git@refs\u002Fpull\u002F40898\u002Fhead\"\n```\n\n## 🚀 Quick Start\n\n### vLLM\n\nGemma4 with Docker:\n```bash\ndocker run --rm -it \\\n  --gpus all \\\n  --ipc=host \\\n  --shm-size=16g \\\n  -p 8000:8000 \\\n  -v ~\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n  ghcr.io\u002Fz-lab\u002Fvllm-openai:gemma4-dflash-cu130 \\\n  google\u002Fgemma-4-26B-A4B-it \\\n  --host 0.0.0.0 \\\n  --port 8000 \\\n  --speculative-config '{\"method\": \"dflash\", \"model\": \"z-lab\u002Fgemma-4-26B-A4B-it-DFlash\", \"num_speculative_tokens\": 15, \"attention_backend\": \"flash_attn\"}' \\\n  --attention-backend triton_attn \\\n  --max-num-batched-tokens 32768 \\\n  --trust-remote-code\n```\n\nNon-Gemma4 models:\n```bash\nvllm serve Qwen\u002FQwen3.5-27B \\\n  --speculative-config '{\"method\": \"dflash\", \"model\": \"z-lab\u002FQwen3.5-27B-DFlash\", \"num_speculative_tokens\": 15}' \\\n  --attention-backend flash_attn \\\n  --max-num-batched-tokens 32768\n```\n\n### SGLang\n\n```bash\nexport SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1\n\n# Optional: enable schedule overlapping (experimental, may not be stable)\n# export SGLANG_ENABLE_SPEC_V2=1\n# export SGLANG_ENABLE_DFLASH_SPEC_V2=1\n# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1\n\npython -m sglang.launch_server \\\n    --model-path Qwen\u002FQwen3.5-35B-A3B \\\n    --speculative-algorithm DFLASH \\\n    --speculative-draft-model-path z-lab\u002FQwen3.5-35B-A3B-DFlash \\\n    --speculative-num-draft-tokens 16 \\\n    --tp-size 1 \\\n    --attention-backend trtllm_mha \\\n    --speculative-draft-attention-backend fa4 \\\n    --mem-fraction-static 0.75 \\\n    --mamba-scheduler-strategy extra_buffer \\\n    --trust-remote-code\n```\n\n### Transformers\n\nOnly Qwen3 and LLaMA-3.1 models support the Transformers backend.\n\n```python\nfrom transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer\n\ndraft = AutoModel.from_pretrained(\"z-lab\u002FQwen3-8B-DFlash-b16\", trust_remote_code=True, dtype=\"auto\", device_map=\"cuda:0\").eval()\ntarget = AutoModelForCausalLM.from_pretrained(\"Qwen\u002FQwen3-8B\", dtype=\"auto\", device_map=\"cuda:0\").eval()\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen\u002FQwen3-8B\")\n\nmessages = [{\"role\": \"user\", \"content\": \"How many positive whole-number divisors does 196 have?\"}]\ninput_ids = tokenizer.apply_chat_template(messages, return_tensors=\"pt\", add_generation_prompt=True, enable_thinking=False).to(draft.device)\n\noutput = draft.spec_generate(input_ids=input_ids, max_new_tokens=2048, temperature=0.0, target=target, stop_token_ids=[tokenizer.eos_token_id])\nprint(tokenizer.decode(output[0], skip_special_tokens=False))\n```\n\n### MLX (Apple Silicon)\n\nThere have been many great community DFlash implementations on MLX; we provide a simple and efficient one here, tested on an Apple M5 Pro with Qwen3, Qwen3.5 and Gemma-4 models.\n\n```python\nfrom dflash.model_mlx import load, load_draft, stream_generate\n\nmodel, tokenizer = load(\"Qwen\u002FQwen3.5-4B\")\ndraft = load_draft(\"z-lab\u002FQwen3.5-4B-DFlash\")\n\nmessages = [{\"role\": \"user\", \"content\": \"How many positive whole-number divisors does 196 have?\"}]\nprompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)\ntps = 0.0\nfor r in stream_generate(model, draft, tokenizer, prompt, block_size=16, max_tokens=2048, temperature=0.6):\n    print(r.text, end=\"\", flush=True)\n    tps = r.generation_tps\nprint(f\"\\nThroughput: {tps:.2f} tok\u002Fs\")\n```\n\n## 📊 Evaluation\n\nAll benchmarks share the same datasets (gsm8k, math500, humaneval, mbpp, mt-bench). Datasets are automatically downloaded and cached as JSONL in `cache\u002F` on first run.\n\n**vLLM**:\n```bash\npython -m dflash.benchmark --backend vllm \\\n    --base-url http:\u002F\u002F127.0.0.1:8000 --model Qwen\u002FQwen3.5-27B \\\n    --dataset gsm8k --num-prompts 128 --concurrency 1 --enable-thinking\n```\n\n**SGLang**:\n```bash\npython -m dflash.benchmark --backend sglang \\\n    --base-url http:\u002F\u002F127.0.0.1:30000 --model Qwen\u002FQwen3.5-35B-A3B \\\n    --dataset gsm8k --num-prompts 128 --concurrency 1 --enable-thinking\n```\n\n**Transformers** (Qwen3 and LLaMA only):\n```bash\ntorchrun --nproc_per_node=8 -m dflash.benchmark --backend transformers \\\n    --model Qwen\u002FQwen3-8B --draft-model z-lab\u002FQwen3-8B-DFlash-b16 \\\n    --dataset gsm8k --max-samples 128\n```\n\n**MLX**:\n```bash\npython -m dflash.benchmark --backend mlx \\\n    --model mlx-community\u002Fgemma-4-31b-it-4bit --draft-model z-lab\u002Fgemma-4-31B-it-DFlash \\\n    --dataset gsm8k --max-samples 128 --enable-thinking\n```\n\n## Acknowledgement\n\nHuge thanks to [@dcw02](https:\u002F\u002Fgithub.com\u002Fdcw02), [@gongy](https:\u002F\u002Fgithub.com\u002Fgongy), and the team at [@modal-labs](https:\u002F\u002Fgithub.com\u002Fmodal-labs) for their fast, high-quality support in bringing DFlash to SGLang. And huge thanks as well to [@benchislett](https:\u002F\u002Fgithub.com\u002Fbenchislett) at NVIDIA for his work in bringing DFlash to vLLM and helping make it available to the broader serving community.\n\n## Citation\nIf you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: [DFlash Feedback](https:\u002F\u002Fforms.gle\u002F4YNwfqb4nJdqn6hq9).\n\n```bibtex\n@article{chen2026dflash,\n  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},\n  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},\n  journal = {arXiv preprint arXiv:2602.06036},\n  year    = {2026}\n}\n```\n","DFlash 是一个专为推测性解码设计的轻量级块扩散模型，它能够实现高效的高质量并行草稿生成。该项目利用了先进的块扩散技术，通过Python语言实现，支持多种大型语言模型（LLM）的加速处理。其核心功能包括但不限于对Gemma、Qwen等系列模型的支持，使得在需要快速生成文本内容或进行大规模文本处理的应用场景中表现出色。此外，DFlash还计划开源训练配方，允许用户根据自身需求训练特定版本的DFlash模型，进一步扩展了其应用范围。此项目特别适合于自然语言处理领域内追求高效文本生成解决方案的研究者与开发者使用。",2,"2026-06-11 03:31:18","trending"]