[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75875":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},75875,"orthrus","chiennv2000\u002Forthrus","chiennv2000","Fast, lossless LLM inference via dual-view diffusion decoding.","",null,"Python",418,17,14,4,0,10,206,3,3.77,"MIT License",false,"main",true,[26,27,28,29,30,31,32],"diffusion-language-models","efficient-inference","large-language-models","llm","llm-efficiency","model-architecture","natural-language-processing","2026-06-12 02:03:36","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Forthrus_logo.svg\" alt=\"Orthrus logo\" width=\"30%\"\u002F>\n\u003C\u002Fp>\n\n# Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion\n\nOfficial implementation and model checkpoints for **Orthrus**, a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Forthrus.png\" width=\"80%\" alt=\"Orthrus Architecture\">\n\u003C\u002Fp>\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2a0b021c-e232-4ac6-bf5c-c582c422505e\n\n## Model Zoo\n \nAll models use a Qwen3 backbone and guarantee **strictly lossless generation**.\n \n| Model | Base Model | HuggingFace | Avg. Speedup |\n| :--- | :--- | :--- | :--- |\n| Orthrus-Qwen3-1.7B | Qwen3-1.7B | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fchiennv\u002FOrthrus-Qwen3-1.7B) | 4.25× |\n| Orthrus-Qwen3-4B | Qwen3-4.0B | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fchiennv\u002FOrthrus-Qwen3-4B) | 5.20× |\n| Orthrus-Qwen3-8B | Qwen3-8.0B | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fchiennv\u002FOrthrus-Qwen3-8B) | 5.36× |\n \n---\n## Installation\n \n```bash\nuv pip install -e .\nuv pip install ninja packaging\nuv pip install flash-attn --no-build-isolation # or: pip install \"flash-attn-4[cu13]\" if your device supports it\n```\n \n> We recommend [`uv`](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) for fast dependency resolution.\n\n---\n \n## Quickstart\n\n> **⚡ Try instantly:** Run Orthrus directly in Colab:\n> [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wYUJRdURtab_4mn6-e1AdT-kTydrzZIz?usp=sharing)\n \n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer\n \n\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"chiennv\u002FOrthrus-Qwen3-8B\",\n    dtype=torch.bfloat16, device_map=\"cuda\",\n    attn_implementation=\"flash_attention_2\",  # options: sdpa | eager | flash_attention_4\n    trust_remote_code=True,\n).eval()\ntokenizer = AutoTokenizer.from_pretrained(\"chiennv\u002FOrthrus-Qwen3-8B\")\n \nprompt = \"Write a program to count the frequency of each word in a paragraph.\"\nmessages = [{\"role\": \"system\", \"content\": \"\"}, {\"role\": \"user\", \"content\": prompt}]\ninput_ids = tokenizer.apply_chat_template(messages, return_tensors=\"pt\", add_generation_prompt=True, enable_thinking=False).input_ids\n\noutput_ids = model.generate(\n    input_ids=input_ids.to(model.device), \n    max_new_tokens=2048,\n    use_diffusion_mode=True, \n    streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation\n)\n```\n\n> **Coming soon:** Native integration with [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) and [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) is coming soon. Stay tuned!\n \n## Key Advantages\n \n- **Significant Inference Acceleration:** Breaks the sequential bottleneck of standard autoregressive decoding, delivering up to a $7.8\\times$ speedup on generation tasks.\n- **Strictly Lossless Generation:** Employs an exact intra-model consensus mechanism to guarantee that the output matches the original base model's exact predictive distribution.\n- **Zero Redundant Memory Overhead:** Both the autoregressive and diffusion views attend to the exact same high-fidelity Key-Value (KV) cache natively, resulting in only an $O(1)$ memory cache overhead.\n- **Parameter Efficient:** Parallel generation capabilities are injected by fine-tuning only 16% of the total model parameters while keeping the base LLM strictly frozen.\n\n---\n\n## Performance Comparison: Orthrus vs. Speculative Decoding\n\nOrthrus outperforms speculative decoding methods like EAGLE-3, DFlash. By natively sharing the exact same KV cache across dual views, Orthrus avoids the redundant memory overhead of draft models, resulting in significantly higher token acceptance rates and faster inference times, especially as context length scales. Orthrus maintains consistently high end-to-end throughput—even at 40K context lengths compared to DFlash's rapid degradation. \n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Facceptance_length.png\" width=\"48%\" alt=\"Average Acceptance Length Comparison\">\n  \u003Cimg src=\"assets\u002Flong_context_benchmark.png\" width=\"48%\" alt=\"Long Context Throughput Benchmark\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cem>\u003Cb>Left:\u003C\u002Fb> Average verified tokens per forward pass compared to EAGLE-3 and DFlash. \u003Cb>Right:\u003C\u002Fb> End-to-end throughput across scaling context lengths. \u003C\u002Fem>\n\u003C\u002Fp>\n\n---\n\n## Comparison with State-of-the-Art Diffusion Models\n\nWhile recent diffusion language models (dLLMs) offer parallel decoding, they often suffer from significant conditional drift and severe accuracy degradation on complex reasoning tasks. Orthrus resolves this by decoupling parallel generation from sequential constraints, establishing a new state-of-the-art for parallel generation fidelity.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fmath500_speed.png\" width=\"60%\" alt=\"Throughput vs. Accuracy on MATH-500\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cem>\u003Cb>Throughput vs. Accuracy on MATH-500.\u003C\u002Fb> Orthrus delivers a ~6x speedup over the Qwen3-8B baseline with strictly lossless performance, whereas adaptations like Fast-dLLM-v2 suffer significant accuracy drops.\u003C\u002Fem>\n\u003C\u002Fp>\n\n---\n\n## Further Support\n \n### MLX (Apple Silicon)\n \nOrthrus supports native inference on Apple Silicon via [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx). Tested with `mlx==0.31.2` and `mlx-lm==0.31.3`.\n\n**Usage:**\n \n```python\nfrom src.model_mlx import load_model_and_tokenizer, mlx_generate\n \nrepo_id = \"chiennv\u002FOrthrus-Qwen3-1.7B\"\nmodel, tokenizer = load_model_and_tokenizer(repo_id)\n \nprompt_tokens = tokenizer.encode(\"If a rectangle has length 12 and width 7, what is its area?\")\n \nfor token in mlx_generate(model, prompt_tokens, tokenizer.eos_token_id, max_tokens=128):\n    print(tokenizer.decode([token]), end=\"\", flush=True)\n```\n\n## Citation\n\nIf you find this model or architecture useful in your work, please cite our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12825):\n\n```bibtex\n@misc{vannguyen2026orthrusmemoryefficientparalleltoken,\n      title={Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion}, \n      author={Chien Van Nguyen and Chaitra Hegde and Van Cuong Pham and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen},\n      year={2026},\n      eprint={2605.12825},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12825}, \n}\n```\n","Orthrus 是一个通过双视图扩散解码实现快速无损的大规模语言模型（LLM）推理框架。它结合了自回归模型的精确生成保真度和扩散模型的高速并行令牌生成能力，特别适用于需要高效且准确文本生成的应用场景。项目基于 Python 语言开发，提供了多种预训练模型选项，这些模型均以 Qwen3 为骨干，并保证严格无损的生成质量。此外，Orthrus 还支持使用 Flash Attention 等先进技术进一步优化性能。无论是对于学术研究还是工业应用中对 LLM 推理效率有高要求的情况，Orthrus 都是一个值得考虑的选择。",2,"2026-06-11 03:53:33","CREATED_QUERY"]