[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83798":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":9,"trendingCount":15,"starSnapshotCount":15,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},83798,"Domino","jianuo-huang\u002FDomino","jianuo-huang","Official implementation of “Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding”.",null,"Python",61,3,51,1,0,7,5,43.01,false,"main",[],"2026-06-12 04:01:41","# Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.29707\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv%3A2605.29707-blue\" alt=\"Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FHuang2020\u002Fdomino\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-Models-yellow\" alt=\"Hugging Face Models\">\u003C\u002Fa>\n\u003C\u002Fp>\nDomino is a speculative decoding method that keeps draft generation block-parallel while adding a lightweight causal correction head to improve draft-token acceptance.\n\n![Domino pipeline](asset\u002Fpipeline.png)\n\n## News\n\n- [2026-05-30] 🔥🔥 Domino training code released! The training implementation is now available in SpecForge via [sgl-project\u002FSpecForge#571](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002FSpecForge\u002Fpull\u002F571).\n- [2026-05-29] 🔥 Domino paper released! Read the paper on [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.29707).\n\n## Demo\n\n![Domino throughput demo](asset\u002FDFlash_demo.gif)\n\n## Supported Models\n\n| Target model | Draft model |\n| --- | --- |\n| `Qwen\u002FQwen3-4B` | [`Huang2020\u002FQwen3-4B-Domino-b16`](https:\u002F\u002Fhuggingface.co\u002FHuang2020\u002FQwen3-4B-Domino-b16) |\n| `Qwen\u002FQwen3-8B` | [`Huang2020\u002FQwen3-8B-Domino-b16`](https:\u002F\u002Fhuggingface.co\u002FHuang2020\u002FQwen3-8B-Domino-b16) |\n\n## Installation\n\nUse Python 3.10 or newer on a CUDA GPU machine. Install a PyTorch build that matches your CUDA driver, then install the remaining Hugging Face benchmark dependencies:\n\n```bash\npython -m pip install --upgrade pip\npython -m pip install -r requirements-hf.txt\n```\n\nFor the SGLang benchmark, install the extra build tools first. On Ubuntu:\n\n```bash\nsudo apt-get update\nsudo apt-get install -y build-essential ninja-build protobuf-compiler\n```\n\nThe SGLang branch also builds a Rust component. Install Rust if `cargo` is not already available:\n\n```bash\ncurl --proto '=https' --tlsv1.2 -sSf https:\u002F\u002Fsh.rustup.rs | sh -s -- -y\nsource \"$HOME\u002F.cargo\u002Fenv\"\n```\n\nThen install the Domino-compatible SGLang branch in the same Python environment:\n\n```bash\ngit clone --branch sglang-feat\u002Fdflash-domino https:\u002F\u002Fgithub.com\u002Fjianuo-huang\u002FDomino.git sglang-domino\ncd sglang-domino\npython -m pip install -e .\u002Fpython\npython -m pip install --force-reinstall --no-deps sglang-kernel \\\n  --index-url https:\u002F\u002Fdocs.sglang.ai\u002Fwhl\u002Fcu130\u002F\ncd -\n```\n\nThis SGLang branch currently resolves to PyTorch 2.11 CUDA 13 wheels. Use the matching SGLang kernel wheel above, and verify that your NVIDIA driver is new enough for CUDA 13 runtime libraries.\n\nFor CUDA 12.8 \u002F PyTorch 2.9, patch the SGLang dependency pins before installing:\n\n```bash\ngit clone --branch sglang-feat\u002Fdflash-domino https:\u002F\u002Fgithub.com\u002Fjianuo-huang\u002FDomino.git sglang-domino\ncd sglang-domino\n\npython -m pip install --upgrade pip\n\nsed -i \\\n  -e 's\u002F\"torch==2.11.0\"\u002F\"torch==2.9.1+cu128\"\u002F' \\\n  -e 's\u002F\"torchaudio==2.11.0\"\u002F\"torchaudio==2.9.1+cu128\"\u002F' \\\n  -e 's\u002F\"torchvision\"\u002F\"torchvision==0.24.1+cu128\"\u002F' \\\n  -e 's\u002F\"kernels\"\u002F\"kernels==0.14.1\"\u002F' \\\n  -e '\u002F\"sglang-kernel==0.4.2\"\u002Fd' \\\n  python\u002Fpyproject.toml\n\npython -m pip install \\\n  --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128 \\\n  -e .\u002Fpython\npython -m pip install --force-reinstall --no-deps \"${SGLANG_KERNEL_CU12_WHEEL}\"\ncd -\n```\n\nSet `SGLANG_KERNEL_CU12_WHEEL` to a CUDA-12-compatible `sglang-kernel` wheel before running the last command. Do not install the `cu130` wheel in a PyTorch 2.9\u002Fcu128 environment.\n\n## Quick Usage\n\nDomino draft checkpoints provide `spec_generate` for direct speculative decoding with a target model. We currently recommend running this path on one GPU.\n\n```python\nfrom transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer\n\ndraft_model = AutoModel.from_pretrained(\n    \"Huang2020\u002FQwen3-8B-Domino-b16\",\n    trust_remote_code=True,\n    dtype=\"auto\",\n    device_map=\"cuda:0\",\n).eval()\n\ntarget_model = AutoModelForCausalLM.from_pretrained(\n    \"Qwen\u002FQwen3-8B\",\n    dtype=\"auto\",\n    device_map=\"cuda:0\",\n).eval()\n\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen\u002FQwen3-8B\")\nprompt = \"How many positive whole-number divisors does 196 have?\"\nmessages = [{\"role\": \"user\", \"content\": prompt}]\n\n# The Domino draft model is trained for Qwen3 with thinking mode disabled.\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n    enable_thinking=False,\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(draft_model.device)\n\noutput_ids = draft_model.spec_generate(\n    input_ids=model_inputs[\"input_ids\"],\n    target=target_model,\n    max_new_tokens=2048,\n    temperature=0.0,\n    stop_token_ids=[tokenizer.eos_token_id],\n)\n\ngenerated_ids = output_ids[:, model_inputs[\"input_ids\"].shape[1]:]\nprint(tokenizer.decode(generated_ids[0], skip_special_tokens=True))\n```\n\n## Hugging Face Benchmark\n\n```bash\nDRAFT_MODEL=Huang2020\u002FQwen3-8B-Domino-b16 \\\nTARGET_MODEL=Qwen\u002FQwen3-8B \\\nPYTHON=python \\\n.\u002Frun_hf_benchmark.sh\n```\n\nDefaults:\n\n- `TASKS=gsm8k:128`\n- `MAX_NEW_TOKENS=2048`\n- `TEMPERATURE=0.0`\n- `BLOCK_SIZE=16`\n- `NUM_GPUS=8`\n\nOverride tasks or runtime settings with environment variables:\n\n```bash\nTASKS=\"gsm8k:128,math500:128\" NUM_GPUS=4 .\u002Frun_hf_benchmark.sh\n```\n\n## SGLang Benchmark\n\n```bash\nDRAFT_MODEL=Huang2020\u002FQwen3-8B-Domino-b16 \\\nTARGET_MODEL=Qwen\u002FQwen3-8B \\\nPYTHON=python \\\n.\u002Frun_sglang_benchmark.sh\n```\n\nDefaults:\n\n- `TASKS=gsm8k:128`\n- `MAX_NEW_TOKENS=2048`\n- `TEMPERATURE=0.0`\n- `CONCURRENCIES=1,2,4,8,16,32`\n\nUse these sample counts to reproduce the paper settings:\n\n```bash\nTASKS=\"gsm8k:128,math500:128,aime24:30,aime25:30,humaneval:164,mbpp:128,livecodebench:128,swe-bench:128,mt-bench:80,alpaca:128\"\n```\n\nOverride tasks or runtime settings with environment variables:\n\n```bash\nTASKS=\"mt-bench:80,alpaca:128\" CONCURRENCIES=1 .\u002Frun_sglang_benchmark.sh\n```\n\n## Acknowledgements\n\nWe thank the authors and maintainers of [DFlash](https:\u002F\u002Fgithub.com\u002Fz-lab\u002Fdflash), [SpecForge](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002FSpecForge), [FlashInfer](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer), and [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang). Their open-source work on block-parallel speculative decoding, speculative-decoding training infrastructure, high-performance attention kernels, and LLM serving helped shape this project and its benchmarking setup.\n\n## Citation\n\nIf you use Domino in your research, please cite:\n\n```bibtex\n@article{huang2026domino,\n  title={Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding},\n  author={Huang, Jianuo and Zhang, Yaojie and Zhang, Qituan and Lin, Hao and Xu, Hanlin and Zhang, Linfeng},\n  journal={arXiv preprint arXiv:2605.29707},\n  year={2026}\n}\n```\n",2,"2026-06-11 04:11:29","CREATED_QUERY"]