[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72077":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},72077,"LatentSync","bytedance\u002FLatentSync","bytedance","Taming Stable Diffusion for Lip Sync!","https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09262",null,"Python",5759,945,77,211,0,19,37,91,57,39.93,"Apache License 2.0",false,"main",[26,27,28,29,30],"diffusion-models","lipsync","research","video-gen","virtual-avatars","2026-06-12 02:02:58","\u003Ch1 align=\"center\">LatentSync\u003C\u002Fh1>\n\n\u003Cdiv align=\"center\">\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09262)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-Model-yellow)](https:\u002F\u002Fhuggingface.co\u002FByteDance\u002FLatentSync-1.6)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-Space-yellow)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ffffiloni\u002FLatentSync)\n\u003Ca href=\"https:\u002F\u002Freplicate.com\u002Flucataco\u002Flatentsync\">\u003Cimg src=\"https:\u002F\u002Freplicate.com\u002Flucataco\u002Flatentsync\u002Fbadge\" alt=\"Replicate\">\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n## 🔥 Updates\n\n- `2025\u002F06\u002F11`: We released **LatentSync 1.6**, which is trained on 512 $\\times$ 512 resolution videos to mitigate the blurriness problem. Watch the demo [here](docs\u002Fchangelog_v1.6.md).\n\n- `2025\u002F03\u002F14`: We released **LatentSync 1.5**, which **(1)** improves temporal consistency via adding temporal layer, **(2)** improves performance on Chinese videos and **(3)** reduces the VRAM requirement of the stage2 training to **20 GB** through a series of optimizations. Learn more details [here](docs\u002Fchangelog_v1.5.md).\n\n## 📖 Introduction\n\nWe present *LatentSync*, an end-to-end lip-sync method based on audio-conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip-sync methods based on pixel-space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations.\n\n## 🏗️ Framework\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"docs\u002Fframework.png\" width=100%>\n\u003Cp>\n\nLatentSync uses the [Whisper](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper) to convert melspectrogram into audio embeddings, which are then integrated into the U-Net via cross-attention layers. The reference and masked frames are channel-wise concatenated with noised latents as the input of U-Net. In the training process, we use a one-step method to get estimated clean latents from predicted noises, which are then decoded to obtain the estimated clean frames. The TREPA, [LPIPS](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.03924) and [SyncNet](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fpublications\u002F2016\u002FChung16a\u002Fchung16a.pdf) losses are added in the pixel space.\n\n## 🎬 Demo\n\n\u003Ctable class=\"center\">\n  \u003Ctr style=\"font-weight: bolder;text-align:center;\">\n        \u003Ctd width=\"50%\">\u003Cb>Original video\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd width=\"50%\">\u003Cb>Lip-synced video\u003C\u002Fb>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb778e3c3-ba25-455d-bdf3-d89db0aa75f4 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fac791682-1541-4e6a-aa11-edd9427b977e controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6d4f4afd-6547-428d-8484-09dc53a19ecf controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb4723d08-c1d4-4237-8251-09c43eb77a6a controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffb4dc4c1-cc98-43dd-a211-1ff8f843fcfa controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7c6ca513-d068-4aa9-8a82-4dfd9063ac4e controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=300px>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0756acef-2f43-4b66-90ba-6dc1d1216904 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd width=300px>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F663ff13d-d716-4a35-8faa-9dcfe955e6a5 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0f7f9845-68b2-4165-bd08-c7bbe01a0e52 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd>\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fc34fe89d-0c09-4de3-8601-3d01229a69e3 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n(Photorealistic videos are filmed by contracted models, and anime videos are from [VASA-1](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fproject\u002Fvasa-1\u002F))\n\n## 📑 Open-source Plan\n\n- [x] Inference code and checkpoints\n- [x] Data processing pipeline\n- [x] Training code\n\n## 🔧 Setting up the Environment\n\nInstall the required packages and download the checkpoints via:\n\n```bash\nsource setup_env.sh\n```\n\nIf the download is successful, the checkpoints should appear as follows:\n\n```\n.\u002Fcheckpoints\u002F\n|-- latentsync_unet.pt\n|-- whisper\n|   `-- tiny.pt\n```\n\nOr you can download `latentsync_unet.pt` and `tiny.pt` manually from our [HuggingFace repo](https:\u002F\u002Fhuggingface.co\u002FByteDance\u002FLatentSync-1.6)\n\n## 🚀 Inference\n\nMinimum VRAM for inference:\n\n- **8 GB** with LatentSync 1.5\n- **18 GB** with LatentSync 1.6\n\nThere are two ways to perform inference:\n\n### 1. Gradio App\n\nRun the Gradio app for inference:\n\n```bash\npython gradio_app.py\n```\n\n### 2. Command Line Interface\n\nRun the script for inference:\n\n```bash\n.\u002Finference.sh\n```\n\nYou can try adjusting the following inference parameters to achieve better results:\n\n- `inference_steps` [20-50]: A higher value improves visual quality but slows down the generation speed.\n- `guidance_scale` [1.0-3.0]: A higher value improves lip-sync accuracy but may cause the video distortion or jitter.\n\n## 🔄 Data Processing Pipeline\n\nThe complete data processing pipeline includes the following steps:\n\n1. Remove the broken video files.\n2. Resample the video FPS to 25, and resample the audio to 16000 Hz.\n3. Scene detect via [PySceneDetect](https:\u002F\u002Fgithub.com\u002FBreakthrough\u002FPySceneDetect).\n4. Split each video into 5-10 second segments.\n5. Affine transform the faces according to the landmarks detected by [InsightFace](https:\u002F\u002Fgithub.com\u002Fdeepinsight\u002Finsightface), then resize to 256 $\\times$ 256.\n6. Remove videos with [sync confidence score](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fpublications\u002F2016\u002FChung16a\u002Fchung16a.pdf) lower than 3, and adjust the audio-visual offset to 0.\n7. Calculate [hyperIQA](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPR_2020\u002Fpapers\u002FSu_Blindly_Assess_Image_Quality_in_the_Wild_Guided_by_a_CVPR_2020_paper.pdf) score, and remove videos with scores lower than 40.\n\nRun the script to execute the data processing pipeline:\n\n```bash\n.\u002Fdata_processing_pipeline.sh\n```\n\nYou should change the parameter `input_dir` in the script to specify the data directory to be processed. The processed videos will be saved in the `high_visual_quality` directory. Each step will generate a new directory to prevent the need to redo the entire pipeline in case the process is interrupted by an unexpected error.\n\n## 🏋️‍♂️ Training U-Net\n\nBefore training, you should process the data as described above. We released a pretrained SyncNet with 94% accuracy on both VoxCeleb2 and HDTF datasets for the supervision of U-Net training. You can execute the following command to download this SyncNet checkpoint:\n\n```bash\nhuggingface-cli download ByteDance\u002FLatentSync-1.6 stable_syncnet.pt --local-dir checkpoints\n```\n\nIf all the preparations are complete, you can train the U-Net with the following script:\n\n```bash\n.\u002Ftrain_unet.sh\n```\n\nWe prepared several UNet configuration files in the ``configs\u002Funet`` directory, each corresponding to a specific training setup:\n\n- `stage1.yaml`: Stage1 training, requires **23 GB** VRAM.\n- `stage2.yaml`: Stage2 training with optimal performance, requires **30 GB** VRAM.\n- `stage2_efficient.yaml`: Efficient Stage 2 training, requires **20 GB** VRAM. It may lead to slight degradation in visual quality and temporal consistency compared with `stage2.yaml`, suitable for users with consumer-grade GPUs, such as the RTX 3090.\n- `stage1_512.yaml`: Stage1 training on 512 $\\times$ 512 resolution videos, requires **30 GB** VRAM.\n- `stage2_512.yaml`: Stage2 training on 512 $\\times$ 512 resolution videos, requires **55 GB** VRAM.\n\nAlso remember to change the parameters in U-Net config file to specify the data directory, checkpoint save path, and other training hyperparameters. For convenience, we prepared a script for writing a data files list. Run the following command:\n\n```bash\npython -m tools.write_fileslist\n```\n\n## 🏋️‍♂️ Training SyncNet\n\nIn case you want to train SyncNet on your own datasets, you can run the following script. The data processing pipeline for SyncNet is the same as U-Net. \n\n```bash\n.\u002Ftrain_syncnet.sh\n```\n\nAfter `validations_steps` training, the loss charts will be saved in `train_output_dir`. They contain both the training and validation loss. If you want to customize the architecture of SyncNet for different image resolutions and input frame lengths, please follow the [guide](docs\u002Fsyncnet_arch.md).\n\n## 📊 Evaluation\n\nYou can evaluate the [sync confidence score](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fpublications\u002F2016\u002FChung16a\u002Fchung16a.pdf) of a generated video by running the following script:\n\n```bash\n.\u002Feval\u002Feval_sync_conf.sh\n```\n\nYou can evaluate the accuracy of SyncNet on a dataset by running the following script:\n\n```bash\n.\u002Feval\u002Feval_syncnet_acc.sh\n```\n\nNote that our released SyncNet is trained on data processed through our data processing pipeline, which includes special operations such as affine transformation and audio-visual adjustment. Therefore, before evaluation, the test data must first be processed using the provided pipeline.\n\n## 🙏 Acknowledgement\n\n- Our code is built on [AnimateDiff](https:\u002F\u002Fgithub.com\u002Fguoyww\u002FAnimateDiff). \n- Some code are borrowed from [MuseTalk](https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk), [StyleSync](https:\u002F\u002Fgithub.com\u002Fguanjz20\u002FStyleSync), [SyncNet](https:\u002F\u002Fgithub.com\u002Fjoonson\u002Fsyncnet_python), [Wav2Lip](https:\u002F\u002Fgithub.com\u002FRudrabha\u002FWav2Lip).\n\nThanks for their generous contributions to the open-source community!\n\n## 📖 Citation\n\nIf you find our repo useful for your research, please consider citing our paper:\n\n```bibtex\n@article{li2024latentsync,\n  title={LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision},\n  author={Li, Chunyu and Zhang, Chao and Xu, Weikai and Lin, Jingyu and Xie, Jinghui and Feng, Weiguo and Peng, Bingyue and Chen, Cunjian and Xing, Weiwei},\n  journal={arXiv preprint arXiv:2412.09262},\n  year={2024}\n}\n```\n","LatentSync 是一个基于音频条件的潜扩散模型的端到端唇同步方法，旨在生成高质量的音视频同步效果。该项目利用 Stable Diffusion 直接建模复杂的音频-视觉相关性，通过将音频嵌入与 U-Net 的交叉注意力层结合，实现了对原始视频和掩码帧的有效处理。其核心功能包括改善时间一致性、优化中文视频表现及降低训练所需的显存需求。适用于需要高质量唇同步效果的场景，如虚拟主播、电影后期制作等。",2,"2026-06-11 03:40:16","high_star"]