[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74173":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":8,"pushedAt":8,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},74173,"daVinci-MagiHuman","GAIR-NLP\u002FdaVinci-MagiHuman","GAIR-NLP",null,"Python",2045,211,15,22,0,11,23,56,33,90.08,false,"main",[],"2026-06-12 04:01:13","![cover](assets\u002Fcover.png)\n\n\n-----\n\n\u003Cdiv align=\"center\">\n\n# daVinci-MagiHuman\n\n### Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fplms.ai\">SII-GAIR\u003C\u002Fa> &nbsp;&amp;&nbsp; \u003Ca href=\"https:\u002F\u002Fsand.ai\">Sand.ai\u003C\u002Fa>\n\u003C\u002Fp>\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2603.21986-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.21986)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Demo-HuggingFace-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FSII-GAIR\u002FdaVinci-MagiHuman)\n[![Models](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Models-HuggingFace-yellow)](https:\u002F\u002Fhuggingface.co\u002FGAIR\u002FdaVinci-MagiHuman)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.12%2B-blue.svg)](https:\u002F\u002Fwww.python.org\u002F)\n[![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.10%2B-ee4c2c.svg)](https:\u002F\u002Fpytorch.org\u002F)\n\n\u003C\u002Fdiv>\n\n## ✨ Highlights\n\n- 🧠 **Single-Stream Transformer** — A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.\n- 🎭 **Exceptional Human-Centric Quality** — Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.\n- 🌍 **Multilingual** — Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.\n- ⚡ **Blazing Fast Inference** — Generates a 5-second 256p video in **2 seconds** and a 5-second 1080p video in **38 seconds** on a single H100 GPU.\n- 🏆 **State-of-the-Art Results** — Achieves **80.0%** win rate vs Ovi 1.1 and **60.9%** vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.\n- 📦 **Fully Open Source** — We release the complete model stack: base model, distilled model, super-resolution model, and inference code.\n\n## 🎬 Demo\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7050a191-38ef-4e36-8b48-0084ccc694f1\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fc6cc056f-56ca-4285-80f3-bb6052228d23\n\n\u003Ctable>\n\u003Ctr valign=\"top\">\n\u003Ctd width=\"33%\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F584d4e13-9956-4ef0-8867-2c78efeac5aa\" controls muted width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003Ctd width=\"33%\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fc5f87f3a-f121-4f34-8d41-8c4b1c24b5e6\" controls muted width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003Ctd width=\"33%\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0fb467e8-e3a4-4155-9d6b-10b2e018bd7f\" controls muted width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\u003Ctable>\n\u003Ctr valign=\"top\">\n\u003Ctd width=\"50%\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F956d55ce-72cf-4dd4-a29e-ea2c3f725864\" controls muted width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003Ctd width=\"50%\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7db9db31-617e-44a6-b2df-99d47accba22\" controls muted width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## 🏗️ Architecture\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Farchitecture.png\" width=\"90%\">\n\u003C\u002Fdiv>\n\ndaVinci-MagiHuman uses a single-stream Transformer that takes text tokens, an optional reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.\n\nKey design choices:\n\n| Component | Description |\n|---|---|\n| 🥪 **Sandwich Architecture** | First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities |\n| 🕐 **Timestep-Free Denoising** | No explicit timestep embeddings — the model infers the denoising state directly from input latents |\n| 🔀 **Per-Head Gating** | Learned scalar gates with sigmoid activation on each attention head for training stability |\n| 🔗 **Unified Conditioning** | Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches |\n\n## 📊 Performance\n\n### Quantitative Quality Benchmark\n\n| Model | Visual Quality ↑ | Text Alignment ↑ | Physical Consistency ↑ | WER ↓ |\n|---|:---:|:---:|:---:|:---:|\n| OVI 1.1 | 4.73 | 4.10 | 4.41 | 40.45% |\n| LTX 2.3 | 4.76 | 4.12 | **4.56** | 19.23% |\n| **daVinci-MagiHuman** | **4.80** | **4.18** | 4.52 | **14.60%** |\n\n### Human Evaluation (2,000 Pairwise Comparisons)\n\n| Matchup | daVinci-MagiHuman Win | Tie | Opponent Win |\n|---|:---:|:---:|:---:|\n| vs Ovi 1.1 | **80.0%** | 8.2% | 11.8% |\n| vs LTX 2.3 | **60.9%** | 17.2% | 21.9% |\n\n### Inference Speed (5-second video, on a single H100 GPU)\n\n| Resolution | Base (s) | Super-Res (s) | Decode (s) | **Total (s)** |\n|---|:---:|:---:|:---:|:---:|\n| 256p | 1.6 | — | 0.4 | **2.0** |\n| 540p | 1.6 | 5.1 | 1.3 | **8.0** |\n| 1080p | 1.6 | 31.0 | 5.8 | **38.4** |\n\n## 🚀 Efficient Inference Techniques\n\n- ⚡ **Latent-Space Super-Resolution** — Two-stage pipeline: generate at low resolution, then refine in latent space, avoiding an extra VAE decode-encode round trip.\n- 🔄 **Turbo VAE Decoder** — A lightweight re-trained decoder that substantially reduces decoding overhead.\n- 🔧 **Full-Graph Compilation** — [MagiCompiler](https:\u002F\u002Fgithub.com\u002FSandAI-org\u002FMagiCompiler) fuses operators across Transformer layers for ~1.2x speedup.\n- 💨 **Distillation** — DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.\n\n## 📦 Getting Started\n\n### Option 1: Docker (Recommended)\n\n```bash\n# Recommended: use the prebuilt MagiHuman image (supports full pipeline including SR 1080p)\ndocker pull sandai\u002Fmagi-human:latest\n\ndocker run -it --gpus all --network host --ipc host \\\n  -v \u002Fpath\u002Fto\u002Frepos:\u002Fworkspace \\\n  -v \u002Fpath\u002Fto\u002Fcheckpoints:\u002Fmodels \\\n  --name my-magi-human \\\n  sandai\u002Fmagi-human:latest \\\n  bash\n\n# Install MagiCompiler\ngit clone https:\u002F\u002Fgithub.com\u002FSandAI-org\u002FMagiCompiler.git\ncd MagiCompiler\npip install -r requirements.txt\npip install .\ncd ..\n\n# Clone daVinci-MagiHuman\ngit clone https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FdaVinci-MagiHuman\ncd daVinci-MagiHuman\n```\n\nIf you prefer manual setup, follow Option 2 (Conda) below.\n\n### Option 2: Conda\n\n```bash\n# Create environment\nconda create -n davinci-magihuman python=3.12\nconda activate davinci-magihuman\nconda install ffmpeg\n\n# Install PyTorch\npip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0\n\n# Install Flash Attention (Hopper)\ngit clone https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\ncd flash-attention\u002Fhopper && python setup.py install && cd ..\u002F..\n\n# Install MagiCompiler\ngit clone https:\u002F\u002Fgithub.com\u002FSandAI-org\u002FMagiCompiler.git\ncd MagiCompiler\npip install -r requirements.txt\npip install .\ncd ..\n\n# Clone and install daVinci-MagiHuman\ngit clone https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FdaVinci-MagiHuman\ncd daVinci-MagiHuman\npip install -r requirements.txt\npip install --no-deps -r requirements-nodeps.txt\n\n# Optional (only for sr-1080p): Install MagiAttention\ngit clone --recursive https:\u002F\u002Fgithub.com\u002FSandAI-org\u002FMagiAttention.git\ncd MagiAttention\ngit checkout v1.0.5\ngit submodule update --init --recursive\npip install -r requirements.txt\npip install --no-build-isolation .\n```\n\n### Download Model Checkpoints\n\nDownload the complete model stack from [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FGAIR\u002FdaVinci-MagiHuman) and update the paths in the config files under `example\u002F`.\n\nYou will also need the following external models:\n\n| Model | Source |\n|---|---|\n| Text Encoder | [t5gemma-9b-9b-ul2](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Ft5gemma-9b-9b-ul2) |\n| Audio Model | [stable-audio-open-1.0](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstable-audio-open-1.0) |\n| VAE | [Wan2.2-TI2V-5B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B) |\n\n## 🎯 Usage\n\nBefore running, update the checkpoint paths in the config files (`example\u002F*\u002Fconfig.json`) to point to your local model directory.\n\n> **Note:** The first run will be slower due to model compilation and cache warmup. Subsequent runs will match the reported inference speeds.\n\n### Input Modes\n\n- **T2V** — Provide `--prompt` only and omit `--image_path`.\n- **TI2V** — Provide both `--prompt` and `--image_path`.\n\n### Example Scripts\n\n**Base Model (256p)**\n```bash\nbash example\u002Fbase\u002Frun_T2V.sh   # T2V\nbash example\u002Fbase\u002Frun_TI2V.sh  # TI2V\n```\n\n**Distilled Model (256p, 8 steps, no CFG)**\n```bash\nbash example\u002Fdistill\u002Frun_T2V.sh\nbash example\u002Fdistill\u002Frun_TI2V.sh\n```\n\n**Super-Resolution to 540p**\n```bash\nbash example\u002Fsr_540p\u002Frun_T2V.sh\nbash example\u002Fsr_540p\u002Frun_TI2V.sh\n```\n\n**Super-Resolution to 1080p**\n```bash\nbash example\u002Fsr_1080p\u002Frun_T2V.sh\nbash example\u002Fsr_1080p\u002Frun_TI2V.sh\n```\n\n### CLI Mode Selection\n\n- If `--image_path` is omitted, `inference\u002Fpipeline\u002Fentry.py` runs **T2V**.\n- If `--image_path` is provided, `inference\u002Fpipeline\u002Fentry.py` runs **TI2V**.\n- The T2V and TI2V scripts under the same example directory reuse the same checkpoint\u002Fconfig stack. The only difference is whether `--image_path` is passed.\n\n## ✍️ Prompt Guidance\n \ndaVinci-MagiHuman uses an **Enhanced Prompt** system that rewrites user inputs into detailed performance directions optimized for avatar-style video generation. For the full system prompt specification, see [`prompts\u002Fenhanced_prompt_design.md`](prompts\u002Fenhanced_prompt_design.md).\n\nBelow is a quick reference for writing effective prompts.\n\n### Output Structure\n \nEvery enhanced prompt has **three parts**:\n \n1. **Main Body** (150–200 words) — A clinical, chronological description of the character's appearance, facial dynamics, vocal delivery, and static cinematography. Written in English regardless of dialogue language.\n \n2. **Dialogue** — Repeats all spoken lines in a structured format:\n   ```\n   Dialogue:\n   \u003Ccharacter description, language>: \"Line content\"\n   ```\n \n3. **Background Sound** — Specifies the most prominent ambient sound:\n   ```\n   Background Sound:\n   \u003CDescription of the background sound>\n   ```\n   Use `\u003CNo prominent background sound>` if none.\n\n### Quick Example\n \n**User input:** A man in a yellow shirt says \"有的人在一起生活一辈子，还带着假面具呢\"\n \n**Enhanced prompt (abbreviated):**\n \n> A young man with short dark hair, wearing a bright yellow polo shirt, sits stationary. His disposition is earnest and slightly agitated... He speaks with a rapid, emphatic tone, his mouth opening wide as he says, \"有 的 人 在 一 起 生 活 一 辈 子，还 带 着 假 面 具 呢...\" His brow furrows, lip muscles showing distinct dynamics...\n>\n> Dialogue:\n> \\\u003CYoung man in yellow polo, Mandarin\\>: \"有 的 人 在 一 起 生 活 一 辈 子，还 带 着 假 面 具 呢...\"\n>\n> Background Sound:\n> \\\u003CNo prominent background sound\\>\n\n## 🙏 Acknowledgements\n\nWe thank the open-source community, and in particular [Wan2.2](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2) and [Turbo-VAED](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FTurbo-VAED), for their valuable contributions.\n\n## 📄 License\n\nThis project is released under the [Apache License 2.0](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0).\n\n## 📖 Citation\n\n```bibtex\n@article{davinci-magihuman-2026,\n  title={Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},\n  author={SII-GAIR and Sand. ai and Chern, Ethan and Teng, Hansi and Sun, Hanwen and Wang, Hao and Pan, Hong and Jia, Hongyu and Su, Jiadi and Li, Jin and Yu, Junjie and Liu, Lijie and Li, Lingzhi and Ye, Lyumanshan and Hu, Min and Wang, Qiangang and Qi, Quanwei and Chern, Steffi and Bu, Tao and Wang, Taoran and Xu, Teren and Zhang, Tianning and Mi, Tiantian and Xu, Weixian and Zhang, Wenqiang and Zhang, Wentai and Yi, Xianping and Cai, Xiaojie and Kang, Xiaoyang and Ma, Yan and Liu, Yixiu and Zhang, Yunbo and Huang, Yunpeng and Lin, Yutong and Tao, Zewei and Liu, Zhaoliang and Zhang, Zheng and Cen, Zhiyao and Yu, Zhixuan and Wang, Zhongshu and Hu, Zhulin and Zhou, Zijin and Guo, Zinan and Cao, Yue and Liu, Pengfei},\n  journal={arXiv preprint arXiv:2603.21986},\n  year={2026}\n}\n```\n","daVinci-MagiHuman 是一个快速的音视频生成基础模型，采用单一流架构。项目核心功能包括通过仅使用自注意力机制处理文本、视频和音频的15B参数40层Transformer，提供高质量的人脸表情、自然语音-表情协调、逼真身体动作及精准音视频同步。支持多种语言如中文（普通话和粤语）、英语、日语等，并能在单个H100 GPU上实现极快的推理速度。该模型在人机对比测试中表现出色，胜率高达80%。适用于需要高效生成高质量多模态内容的场景，例如虚拟人物创建、视频合成等。代码完全开源，包含基础模型、蒸馏模型、超分辨率模型及推理代码。",2,"2026-06-11 03:49:22","high_star"]