[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-4601":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},4601,"Hallo-Live","fudan-generative-vision\u002FHallo-Live","fudan-generative-vision","Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.23632",null,"Python",222,35,17,6,0,24,57,160,72,4.67,"MIT License",false,"main",[26,27,28],"audio-video-gen","avatars","diffusion-models","2026-06-12 02:01:02","\u003C!-- \u003Ch1 align=\"center\">Hallo-Live\u003C\u002Fh1> -->\n\u003Ch1 align=\"center\">Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation\u003C\u002Fh1>\n\n\u003Cdiv align='center'>\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fchunyu-li\" target=\"_blank\">Chunyu Li\u003C\u002Fa>\u003Csup>1,2,*\u003C\u002Fsup> &emsp;\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FHallo-Live\" target=\"_blank\">Jiaye Li\u003C\u002Fa>\u003Csup>2,*\u003C\u002Fsup> &emsp;\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FHallo-Live\" target=\"_blank\">Ruiqiao Mei\u003C\u002Fa>\u003Csup>2\u003C\u002Fsup> &emsp;\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FHallo-Live\" target=\"_blank\">Haoyuan Xia\u003C\u002Fa>\u003Csup>1,3\u003C\u002Fsup>\n\u003C\u002Fdiv>\n\u003Cdiv align='center'>\n\u003Ca href=\"http:\u002F\u002Fzhuhao.cc\u002Fhome\u002F\" target=\"_blank\">Hao Zhu\u003C\u002Fa>\u003Csup>4\u003C\u002Fsup> &emsp;\n\u003Ca href=\"https:\u002F\u002Fjingdongwang2017.github.io\u002F\" target=\"_blank\">Jingdong Wang\u003C\u002Fa>\u003Csup>5\u003C\u002Fsup> &emsp;\n\u003Ca href=\"https:\u002F\u002Fsites.google.com\u002Fsite\u002Fzhusiyucs\u002Fhome\" target=\"_blank\">Siyu Zhu\u003C\u002Fa>\u003Csup>1,2,&dagger;\u003C\u002Fsup>\n\u003C\u002Fdiv>\n\n\u003Cbr>\n\n\u003Cdiv align='center'>\n\u003Csup>1\u003C\u002Fsup>Shanghai Innovation Institute &emsp;\n\u003Csup>2\u003C\u002Fsup>Fudan University\n\u003C\u002Fdiv>\n\u003Cdiv align='center'>\n\u003Csup>3\u003C\u002Fsup>University of Science and Technology of China &emsp;\n\u003Csup>4\u003C\u002Fsup>Nanjing University &emsp;\n\u003Csup>5\u003C\u002Fsup>Baidu\n\u003C\u002Fdiv>\n\n\u003Cbr>\n\n\u003Cdiv align=\"center\">\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2604.23632-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.23632)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-Model-yellow)](https:\u002F\u002Fhuggingface.co\u002Ffudan-generative-ai\u002FHallo-Live)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green.svg)](LICENSE)\n\n\u003C\u002Fdiv>\n\n## 📖 Introduction\n\nWe present *Hallo-Live*, a real-time text-driven joint audio-video avatar generation framework. The method adopts a causal dual-stream DiT model to generate synchronized avatar video and speech in a streaming manner. *Hallo-Live* reaches **20.38 FPS** with **0.94 s latency** on two NVIDIA H200 GPUs, while preserving strong lip-sync accuracy, visual fidelity, and speech quality.\n\n## 🏗️ Framework\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"docs\u002Fframework.png\" width=100%>\n\u003Cp>\n\nThe framework of *Hallo-Live*. **Top left**: Stage I training adapts a pretrained dual-stream DiT to the streaming setting using cross-modal future-expanding block-causal mask. **Bottom left**: Stage II training performs autoregressive self-rollout with the audio-video KV cache and optimizes the generated trajectory with reward-weighted dual-stream DMD. **Right**: Each causal fusion block in the dual-stream DiT consists of cross-modal attention between the video and audio streams, where the block-causal masks are utilized in Stage I ODE initialization, and KV cache is maintained for Stage II self-rollout and streaming inference.\n\n## 🎬 Demo\n\n### Main Demo\n\nThe main demo showcases Hallo-Live’s real-time text-driven audio-video generation capabilities across anime-style characters, realistic human subjects, and multi-speaker scenarios.\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F21b91995-610e-4733-b31c-79fb063fcf02\n\nThe following demos show each individual video along with its corresponding prompt. Click the prompt preview to expand the full text.\n\n\u003Ctable class=\"center\" width=\"100%\">\n  \u003Ccolgroup>\n    \u003Ccol width=\"50%\">\n    \u003Ccol width=\"50%\">\n  \u003C\u002Fcolgroup>\n  \u003Ctr style=\"font-weight: bolder;text-align:center;\">\n        \u003Ctd width=\"50%\">\u003Cb>Input Prompt\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd width=\"50%\">\u003Cb>Generated Video\u003C\u002Fb>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\n      \u003Cdetails>\n        \u003Csummary>Office close-up, man asks about the slides...\u003C\u002Fsummary>\n        Close-up on a man in an office. Window light creates soft highlights. He wears a suit, lapel texture visible. Background is blurred desks. He sits in chair, back straight. Face is head-and-shoulders, mouth sharp. He nods slightly while speaking. &lt;S&gt;Meeting starts in five.&lt;E&gt; &lt;S&gt;Have you got the slides?&lt;E&gt; &lt;AUDCAP&gt;Office hum, phone ring distant, professional male voice with clear articulation; no music.&lt;ENDAUDCAP&gt;\n      \u003C\u002Fdetails>\n    \u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ff3efd06c-6cb3-42d4-9a9a-b78244d12993 controls preload=\"metadata\" width=\"100%\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\n      \u003Cdetails>\n        \u003Csummary>3D anime recording studio, asking for one more take...\u003C\u002Fsummary>\n        3D anime cartoon style, polished toon-shaded character rendering, soft stylized materials, expressive face and eyes, smooth animation-ready posing, clear mouth shapes for readable lip sync. In a dimly lit recording studio with acoustic foam panels, a woman with curly brown hair sits framed head-and-shoulders. A large condenser microphone stands slightly off-axis to avoid plosives. Soft blue LED strip light outlines the background gear. Her skin shows natural texture under the key light. She holds a lyric sheet steady in her left hand, fingers visible against the white paper. No sudden movements occur. She breathes in slowly, then speaks directly into the mic. Her lips part clearly for each word. The paper remains still in her grip throughout the clip. &lt;S&gt;I think we need one more take.&lt;E&gt; &lt;S&gt;The harmony felt rushed.&lt;E&gt; &lt;AUDCAP&gt;Clear female voice with soft reverb; faint hum of ventilation; rustle of paper; no music; close proximity effect on mic; room tone is present.&lt;ENDAUDCAP&gt;\n      \u003C\u002Fdetails>\n    \u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F987043fa-9390-48af-a2b3-7d6480e0be0b controls preload=\"metadata\" width=\"100%\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\n      \u003Cdetails>\n        \u003Csummary>Hand-drawn anime cafe scene, waiting until midnight...\u003C\u002Fsummary>\n        Hand-drawn anime style with clean outlines, stylized proportions, bright illustrated lighting, detailed background art, expressive facial acting, crisp lip movement. Framed in close-up head-and-shoulders, a man with stubble sits in a dimly lit cafe. Neon sign glow reflects in his eyes, casting blue rim light on his profile. A condensation-covered glass sits on the table edge, visible in lower frame. His lips are sharp under the mixed lighting, moving clearly as he talks. His left hand rests on the table edge, fingers visible; no objects pass in front of it. He blinks slowly, then speaks with a slight nod. &lt;S&gt;They said the train was delayed.&lt;E&gt; &lt;S&gt;Now we wait until midnight.&lt;E&gt; &lt;AUDCAP&gt;Low cafe murmur, ice clinking in glass, HVAC hum; tired male voice with low resonance.&lt;ENDAUDCAP&gt;\n      \u003C\u002Fdetails>\n    \u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa26173d4-c3fc-48ee-baae-00c231302539 controls preload=\"metadata\" width=\"100%\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\n      \u003Cdetails>\n        \u003Csummary>Clay-court tennis player, retrying a serve...\u003C\u002Fsummary>\n        A tennis player stands on a clay court, framed chest-up in a tight-medium shot, red dust visible on his shirt. Sunlight is bright but diffused by a slight haze, preventing harsh shadows on his face. He holds a racket over his shoulder, the grip tape texture visible and hand position static. The camera remains steady, focusing on his eyes and the sweat on his brow. His mouth is open slightly as he speaks, clearly readable against the blurred net background. He shifts his weight slowly from one foot to the other, avoiding sudden jerks. &lt;S&gt;That serve was too wide.&lt;E&gt; &lt;S&gt;Let's try again, same spot.&lt;E&gt; &lt;AUDCAP&gt;Wind blowing across court; distant ball thud; clear male voice with athletic breath; clay court surface noise; natural outdoor sports ambience without crowd noise.&lt;ENDAUDCAP&gt;\n      \u003C\u002Fdetails>\n    \u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F9e7996d4-d600-4687-8b76-7e700635fae5 controls preload=\"metadata\" width=\"100%\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\n      \u003Cdetails>\n        \u003Csummary>3D anime room scene, boy playing guitar...\u003C\u002Fsummary>\n        3D anime cartoon style, polished toon-shaded character rendering, soft stylized materials, expressive face and eyes, smooth animation-ready posing, clear mouth shapes for readable lip sync. Tight-medium shot of a boy in a room, chest-up, lit by computer monitor glow. Screen reflection shines in his eyes. He wears a hoodie, drawstrings hanging still. He holds a guitar neck, frets visible under his fingers. Mouth is clear in the blueish light. Right hand strums slowly; left hand holds chord. Posters on wall blurred behind. No head banging, stable posture. &lt;S&gt;I finally learned the chorus.&lt;E&gt; &lt;S&gt;It sounds pretty good.&lt;E&gt; &lt;AUDCAP&gt;Young male voice, proud; guitar strum sound; fan hum; no music.&lt;ENDAUDCAP&gt;\n      \u003C\u002Fdetails>\n    \u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F9af70b95-2824-4ca4-a2e9-ce315f829de9 controls preload=\"metadata\" width=\"100%\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## 🔧 Installation\n\n### 1. Environment Setup\n\nCreate and activate the conda environment, then install Python dependencies:\n\n```bash\nsource tools\u002Fsetup_env.sh\n```\n\n### 2. Download Models\n\nSet the model root `MODEL_DIR=\u002Fpath\u002Fto\u002Fyour\u002Fmodel_dir` in `tools\u002Fdownload_models.sh` before downloading.\n\nFor inference, you only need to download the required text encoder, VAEs, and DiT models:\n\n```bash\nbash tools\u002Fdownload_models.sh inference\n```\n\nFor vanilla training or RL-based training, download the additional models as needed:\n\n```bash\n# Ovi model as real\u002Ffake score function for train.py\nbash tools\u002Fdownload_models.sh train\n\n# Optional reward models for RL-based training\nbash tools\u002Fdownload_models.sh reward\n```\n\n## 🚀 Inference\n\nBefore inference, make sure the `MODEL_DIR` in `scripts\u002Finference.sh` points to the root directory that contains the downloaded models, then run the script:\n\n```bash\nbash scripts\u002Finference.sh\n```\n\nGenerated videos will be saved in `output_folder`.\n\n### Parallel VAE Decoding\n\nFor faster inference on machines with at least two visible CUDA devices, you can overlap diffusion generation and VAE decoding by enabling parallel VAE decode:\n\n```bash\nbash scripts\u002Finference_parallel_vae.sh\n```\n\nThis mode keeps the diffusion model on the main GPU and moves the video\u002Faudio VAEs to a second GPU. It decodes generated video blocks in a background CUDA stream while the diffusion GPU continues generating the next block. By default, the VAE decode device is the next visible CUDA device after the diffusion device.\n\n## 🏋️‍♂️ Training\n\nTraining uses `torchrun` and FSDP. Before launching, check the following fields in the config:\n\n- `model_dir`: directory containing Ovi, Wan, MMAudio, and optional reward checkpoints.\n- `data_path`: prompt CSV or LMDB path.\n- `generator_ckpt`: initialization checkpoint for the student.\n- `real_score_ckpt` and `fake_score_ckpt`: teacher and critic initialization checkpoints.\n- `save_ckpt_dir`: output directory for training checkpoints.\n- `sharding_strategy`, `generator_fsdp_wrap_strategy`, `real_score_fsdp_wrap_strategy`, `fake_score_fsdp_wrap_strategy`: distributed training strategy.\n\n### Training Dataset\n\nFor convenience, we’ve open-sourced the training dataset `synthetic_prompts_32k.csv` in our [HuggingFace repo](https:\u002F\u002Fhuggingface.co\u002Ffudan-generative-ai\u002FHallo-Live). You can either download it manually or use the following command to download it directly:\n\n```bash\nhf download fudan-generative-ai\u002FHallo-Live synthetic_prompts_32k.csv --local-dir \"prompts\u002Fdata\"\n```\n\n### Stage 1: Dual-Stream ODE Initialization\n\nFor convenience, we’ve provided a stage 1 ODE initialization checkpoint. You can manually download `hallolive_ode_init.pt` from our [HuggingFace repo](https:\u002F\u002Fhuggingface.co\u002Ffudan-generative-ai\u002FHallo-Live), or download it directly using the command line:\n\n```bash\nhf download fudan-generative-ai\u002FHallo-Live hallolive_ode_init.pt --local-dir \"$MODEL_DIR\u002FHallo-Live\"\n```\n\nWe also provide utilities for generating ODE initialization data and packing it into LMDB, in case you want to perform stage 1 training on your own prompt dataset:\n\n```bash\nbash scripts\u002Fsample_ode_data.sh\n```\n\nThe script performs three steps:\n\n1. Generate ODE trajectories with `hallolive.utils.sample_ode_data`.\n2. Build video-to-prompt mappings with `tools\u002Fcreate_video_mappings.py`.\n3. Convert latent `.pt` files into LMDB with `hallolive.utils.create_lmdb`.\n\nAfter the LMDB dataset is created, run the script for ODE initialization training:\n\n```bash\nbash scripts\u002Ftrain_ode_init.sh\n```\n\n### Stage 2: Self-Rollout + Dual-Stream DMD\n\nFirst, modify the `generator_ckpt` path in the config file to point to the checkpoint obtained after completing your ODE initialization training. Then run the script for DMD training:\n\n```bash\nbash scripts\u002Ftrain_dmd_5B.sh\n```\n\nTo perform multi-node training, such as on 16 or 32 GPUs, run the script:\n\n```bash\nbash scripts\u002Ftrain_dmd_5B_multinode.sh\n```\n\nTo reproduce HP-DMD, enable reward guidance in the DMD config:\n\n```yaml\nenable_rl_reward: true\nreward_types: [videoalign, audiobox, sync]\nreward_beta: 2.0\nreward_model_cpu_offload: true\n```\n\nThe paper uses a continued Stage 2 strategy: first train video and audio jointly until the video stream stabilizes, then freeze the video stream and continue audio-only optimization. In this repository, audio-only continued training is controlled by:\n\n```yaml\ntrain_audio_stream_only: true\nvideo_loss_weight: 0\naudio_loss_weight: 0.15\n```\n\nSee `configs\u002Fdual_stream_dmd_5B_audio.yaml` for an example.\n\n## 📊 Evaluation\n\nRun batch inference for checkpoint evaluation:\n\n```bash\nbash scripts\u002Finference_eval.sh\n```\n\nThe script generates videos for selected checkpoint steps and writes `video_prompt_mappings.json` for downstream scoring.\n\n## 🙏 Acknowledgements\n\nThis project builds on and benefits from the following open-source projects and research codebases:\n\n- [Ovi](https:\u002F\u002Fgithub.com\u002Fcharacter-ai\u002FOvi) for high-quality joint audio-video generation.\n- [Self-forcing](https:\u002F\u002Fgithub.com\u002Fguandeh17\u002FSelf-Forcing) for autoregressive self-rollout training and DMD code.\n- [Wan](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2) for video generation components.\n- [MMAudio](https:\u002F\u002Fgithub.com\u002Fhkchengrex\u002FMMAudio) for audio VAE components.\n- [VideoAlign](https:\u002F\u002Fgithub.com\u002FKwaiVGI\u002FVideoAlign), [AudioBox Aesthetics](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiobox-aesthetics) and [SyncNet](https:\u002F\u002Fgithub.com\u002Fjoonson\u002Fsyncnet_python) for reward modeling.\n\n## 📖 Citation\n\nIf you find this repository useful, please cite:\n\n```bibtex\n@article{li2026hallo,\n  title={Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation},\n  author={Li, Chunyu and Li, Jiaye and Mei, Ruiqiao and Xia, Haoyuan and Zhu, Hao and Wang, Jingdong and Zhu, Siyu},\n  journal={arXiv preprint arXiv:2604.23632},\n  year={2026}\n}\n```\n","Hallo-Live 是一个实时流媒体联合音频-视频虚拟形象生成框架。该项目采用因果双流DiT模型，能够同步生成虚拟形象的视频和语音，具有20.38 FPS的处理速度和0.94秒的延迟，同时保持高精度的唇形同步、视觉保真度和语音质量。其核心技术包括使用跨模态未来扩展块因果掩码进行阶段I训练，以及在阶段II训练中通过音频-视频KV缓存和奖励加权双流DMD优化自回归自我展开轨迹。适用于需要高质量实时互动虚拟形象的应用场景，如在线教育、虚拟主播或远程会议等。",2,"2026-06-11 02:59:57","CREATED_QUERY"]