[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72075":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},72075,"MuseTalk","TMElyralab\u002FMuseTalk","TMElyralab","MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting","",null,"Python",5958,859,63,156,0,30,75,236,90,39.8,"Other",false,"main",[26,27],"lip-sync","virtualhumans","2026-06-12 02:02:58","# MuseTalk\n\n\u003Cstrong>MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling\u003C\u002Fstrong>\n\nYue Zhang\u003Csup>\\*\u003C\u002Fsup>,\nZhizhou Zhong\u003Csup>\\*\u003C\u002Fsup>,\nMinhao Liu\u003Csup>\\*\u003C\u002Fsup>,\nZhaokang Chen,\nBin Wu\u003Csup>†\u003C\u002Fsup>,\nYubin Zeng, \nChao Zhan,\nJunxin Huang,\nYingjie He,\nWenjiang Zhou\n(\u003Csup>*\u003C\u002Fsup>Equal Contribution, \u003Csup>†\u003C\u002Fsup>Corresponding Author, benbinwu@tencent.com)\n\nLyra Lab, Tencent Music Entertainment\n\n**[github](https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk)**    **[huggingface](https:\u002F\u002Fhuggingface.co\u002FTMElyralab\u002FMuseTalk)**    **[space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTMElyralab\u002FMuseTalk)**    **[Technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10122)**\n\nWe introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseV), as a complete virtual human solution.\n\n## 🔥 Updates\nWe're excited to unveil MuseTalk 1.5. \nThis version **(1)** integrates training with perceptual loss, GAN loss, and sync loss, significantly boosting its overall performance. **(2)** We've implemented a two-stage training strategy and a spatio-temporal data sampling approach to strike a balance between visual quality and lip-sync accuracy. \nLearn more details [here](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10122).\n**The inference codes, training codes and model weights of MuseTalk 1.5 are all available now!** 🚀\n\n# Overview\n`MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which\n\n1. modifies an unseen face according to the input audio, with a size of face region of `256 x 256`.\n1. supports audio in various languages, such as Chinese, English, and Japanese.\n1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.\n1. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results. \n1. checkpoint available trained on the HDTF and private dataset.\n\n# News\n- [04\u002F05\u002F2025] :mega: We are excited to announce that the training code is now open-sourced! You can now train your own MuseTalk model using our provided training scripts and configurations.\n- [03\u002F28\u002F2025] We are thrilled to announce the release of our 1.5 version. This version is a significant improvement over the 1.0 version, with enhanced clarity, identity consistency, and precise lip-speech synchronization. We update the [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10122) with more details.\n- [10\u002F18\u002F2024] We release the [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10122v2). Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance.\n- [04\u002F17\u002F2024] We release a pipeline that utilizes MuseTalk for real-time inference.\n- [04\u002F16\u002F2024] Release Gradio [demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTMElyralab\u002FMuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant)\n- [04\u002F02\u002F2024] Release MuseTalk project and pretrained models.\n\n\n## Model\n![Model Structure](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F02f4a214-1bdd-4326-983c-e70b478accba)\nMuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention. \n\nNote that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.\n\n## Cases\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"33%\">\n\n### Input Video\n---\nhttps:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F37a3a666-7b90-4244-8d3a-058cb0e44107\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F1ce3e850-90ac-4a31-a45f-8dfa4f2960ac\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffa3b13a1-ae26-4d1d-899e-87435f8d22b3\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F15800692-39d1-4f4c-99f2-aef044dc3251\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa843f9c9-136d-4ed4-9303-4a7269787a60\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6eb4e70e-9e19-48e9-85a9-bbfa589c5fcb\n\n\u003C\u002Ftd>\n\u003Ctd width=\"33%\">\n\n### MuseTalk 1.0\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fc04f3cd5-9f77-40e9-aafd-61978380d0ef\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2051a388-1cef-4c1d-b2a2-3c1ceee5dc99\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb5f56f71-5cdc-4e2e-a519-454242000d32\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa5843835-04ab-4c31-989f-0995cfc22f34\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3dc7f1d7-8747-4733-bbdd-97874af0c028\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3c78064e-faad-4637-83ae-28452a22b09a\n\n\u003C\u002Ftd>\n\u003Ctd width=\"33%\">\n\n### MuseTalk 1.5\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F999a6f5b-61dd-48e1-b902-bb3f9cbc7247\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd26a5c9a-003c-489d-a043-c9a331456e75\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F471290d7-b157-4cf6-8a6d-7e899afa302c\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F1ee77c4c-8c70-4add-b6db-583a12faa7dc\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F370510ea-624c-43b7-bbb0-ab5333e0fcc4\n\n---\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb011ece9-a332-4bc1-b8b7-ef6e383d7bde\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\n# TODO:\n- [x] trained models and inference codes.\n- [x] Huggingface Gradio [demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTMElyralab\u002FMuseTalk).\n- [x] codes for real-time inference.\n- [x] [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10122v2).\n- [x] a better model with updated [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10122).\n- [x] realtime inference code for 1.5 version.\n- [x] training and data preprocessing codes. \n- [ ] **always** welcome to submit issues and PRs to improve this repository! 😊\n\n\n# Getting Started\nWe provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:\n\n## Third party integration\nThanks for the third-party integration, which makes installation and use more convenient for everyone.\nWe also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.\n\n### [ComfyUI](https:\u002F\u002Fgithub.com\u002Fchaojie\u002FComfyUI-MuseTalk)\n\n## Installation\nTo prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:\n\n### Build environment\nWe recommend Python 3.10 and CUDA 11.7. Set up your environment as follows:\n\n```shell\nconda create -n MuseTalk python==3.10\nconda activate MuseTalk\n```\n\n### Install PyTorch 2.0.1\nChoose one of the following installation methods:\n\n```shell\n# Option 1: Using pip\npip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# Option 2: Using conda\nconda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia\n```\n\n### Install Dependencies\nInstall the remaining required packages:\n\n```shell\npip install -r requirements.txt\n```\n\n### Install MMLab Packages\nInstall the MMLab ecosystem packages:\n\n```bash\npip install --no-cache-dir -U openmim\nmim install mmengine\nmim install \"mmcv==2.0.1\"\nmim install \"mmdet==3.1.0\"\nmim install \"mmpose==1.1.0\"\n```\n\n### Setup FFmpeg\n1. [Download](https:\u002F\u002Fgithub.com\u002FBtbN\u002FFFmpeg-Builds\u002Freleases) the ffmpeg-static package\n\n2. Configure FFmpeg based on your operating system:\n\nFor Linux:\n```bash\nexport FFMPEG_PATH=\u002Fpath\u002Fto\u002Fffmpeg\n# Example:\nexport FFMPEG_PATH=\u002Fmusetalk\u002Fffmpeg-4.4-amd64-static\n```\n\nFor Windows:\nAdd the `ffmpeg-xxx\\bin` directory to your system's PATH environment variable. Verify the installation by running `ffmpeg -version` in the command prompt - it should display the ffmpeg version information.\n\n### Download weights\nYou can download weights in two ways:\n\n#### Option 1: Using Download Scripts\nWe provide two scripts for automatic downloading:\n\nFor Linux:\n```bash\nsh .\u002Fdownload_weights.sh\n```\n\nFor Windows:\n```batch\n# Run the script\ndownload_weights.bat\n```\n\n#### Option 2: Manual Download\nYou can also download the weights manually from the following links:\n\n1. Download our trained [weights](https:\u002F\u002Fhuggingface.co\u002FTMElyralab\u002FMuseTalk\u002Ftree\u002Fmain)\n2. Download the weights of other components:\n   - [sd-vae-ft-mse](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fsd-vae-ft-mse\u002Ftree\u002Fmain)\n   - [whisper](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-tiny\u002Ftree\u002Fmain)\n   - [dwpose](https:\u002F\u002Fhuggingface.co\u002Fyzd-v\u002FDWPose\u002Ftree\u002Fmain)\n   - [syncnet](https:\u002F\u002Fhuggingface.co\u002FByteDance\u002FLatentSync\u002Ftree\u002Fmain)\n   - [face-parse-bisent](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F154JgKpzCPW82qINcVieuPH3fZ2e0P812\u002Fview?pli=1)\n   - [resnet18](https:\u002F\u002Fdownload.pytorch.org\u002Fmodels\u002Fresnet18-5c106cde.pth)\n\nFinally, these weights should be organized in `models` as follows:\n```\n.\u002Fmodels\u002F\n├── musetalk\n│   └── musetalk.json\n│   └── pytorch_model.bin\n├── musetalkV15\n│   └── musetalk.json\n│   └── unet.pth\n├── syncnet\n│   └── latentsync_syncnet.pt\n├── dwpose\n│   └── dw-ll_ucoco_384.pth\n├── face-parse-bisent\n│   ├── 79999_iter.pth\n│   └── resnet18-5c106cde.pth\n├── sd-vae\n│   ├── config.json\n│   └── diffusion_pytorch_model.bin\n└── whisper\n    ├── config.json\n    ├── pytorch_model.bin\n    └── preprocessor_config.json\n    \n```\n## Quickstart\n\n### Inference\nWe provide inference scripts for both versions of MuseTalk:\n\n#### Prerequisites\nBefore running inference, please ensure ffmpeg is installed and accessible:\n```bash\n# Check ffmpeg installation\nffmpeg -version\n```\nIf ffmpeg is not found, please install it first:\n- Windows: Download from [ffmpeg-static](https:\u002F\u002Fgithub.com\u002FBtbN\u002FFFmpeg-Builds\u002Freleases) and add to PATH\n- Linux: `sudo apt-get install ffmpeg`\n\n#### Normal Inference\n##### Linux Environment\n```bash\n# MuseTalk 1.5 (Recommended)\nsh inference.sh v1.5 normal\n\n# MuseTalk 1.0\nsh inference.sh v1.0 normal\n```\n\n##### Windows Environment\n\nPlease ensure that you set the `ffmpeg_path` to match the actual location of your FFmpeg installation.\n\n```bash\n# MuseTalk 1.5 (Recommended)\npython -m scripts.inference --inference_config configs\\inference\\test.yaml --result_dir results\\test --unet_model_path models\\musetalkV15\\unet.pth --unet_config models\\musetalkV15\\musetalk.json --version v15 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\\bin\n\n# For MuseTalk 1.0, change:\n# - models\\musetalkV15 -> models\\musetalk\n# - unet.pth -> pytorch_model.bin\n# - --version v15 -> --version v1\n```\n\n#### Real-time Inference\n##### Linux Environment\n```bash\n# MuseTalk 1.5 (Recommended)\nsh inference.sh v1.5 realtime\n\n# MuseTalk 1.0\nsh inference.sh v1.0 realtime\n```\n\n##### Windows Environment\n```bash\n# MuseTalk 1.5 (Recommended)\npython -m scripts.realtime_inference --inference_config configs\\inference\\realtime.yaml --result_dir results\\realtime --unet_model_path models\\musetalkV15\\unet.pth --unet_config models\\musetalkV15\\musetalk.json --version v15 --fps 25 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\\bin\n\n# For MuseTalk 1.0, change:\n# - models\\musetalkV15 -> models\\musetalk\n# - unet.pth -> pytorch_model.bin\n# - --version v15 -> --version v1\n```\n\nThe configuration file `configs\u002Finference\u002Ftest.yaml` contains the inference settings, including:\n- `video_path`: Path to the input video, image file, or directory of images\n- `audio_path`: Path to the input audio file\n\nNote: For optimal results, we recommend using input videos with 25fps, which is the same fps used during model training. If your video has a lower frame rate, you can use frame interpolation or convert it to 25fps using ffmpeg.\n\nImportant notes for real-time inference:\n1. Set `preparation` to `True` when processing a new avatar\n2. After preparation, the avatar will generate videos using audio clips from `audio_clips`\n3. The generation process can achieve 30fps+ on an NVIDIA Tesla V100\n4. Set `preparation` to `False` for generating more videos with the same avatar\n\nFor faster generation without saving images, you can use:\n```bash\npython -m scripts.realtime_inference --inference_config configs\u002Finference\u002Frealtime.yaml --skip_save_images\n```\n\n## Gradio Demo\nWe provide an intuitive web interface through Gradio for users to easily adjust input parameters. To optimize inference time, users can generate only the **first frame** to fine-tune the best lip-sync parameters, which helps reduce facial artifacts in the final output.\n![para](assets\u002Ffigs\u002Fgradio_2.png)\nFor minimum hardware requirements, we tested the system on a Windows environment using an NVIDIA GeForce RTX 3050 Ti Laptop GPU with 4GB VRAM. In fp16 mode, generating an 8-second video takes approximately 5 minutes. ![speed](assets\u002Ffigs\u002Fgradio.png)\n\nBoth Linux and Windows users can launch the demo using the following command. Please ensure that the `ffmpeg_path` parameter matches your actual FFmpeg installation path:\n\n```bash\n# You can remove --use_float16 for better quality, but it will increase VRAM usage and inference time\npython app.py --use_float16 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\\bin\n```\n\n## Training\n\n### Data Preparation\nTo train MuseTalk, you need to prepare your dataset following these steps:\n\n1. **Place your source videos** \n\n   For example, if you're using the HDTF dataset, place all your video files in `.\u002Fdataset\u002FHDTF\u002Fsource`.\n\n2. **Run the preprocessing script**\n   ```bash\n   python -m scripts.preprocess --config .\u002Fconfigs\u002Ftraining\u002Fpreprocess.yaml\n   ```\n   This script will:\n   - Extract frames from videos\n   - Detect and align faces\n   - Generate audio features\n   - Create the necessary data structure for training\n\n### Training Process\nAfter data preprocessing, you can start the training process:\n\n1. **First Stage**\n   ```bash\n   sh train.sh stage1\n   ```\n\n2. **Second Stage**\n   ```bash\n   sh train.sh stage2\n   ```\n\n### Configuration Adjustment\nBefore starting the training, you should adjust the configuration files according to your hardware and requirements:\n\n1. **GPU Configuration** (`configs\u002Ftraining\u002Fgpu.yaml`):\n   - `gpu_ids`: Specify the GPU IDs you want to use (e.g., \"0,1,2,3\")\n   - `num_processes`: Set this to match the number of GPUs you're using\n\n2. **Stage 1 Configuration** (`configs\u002Ftraining\u002Fstage1.yaml`):\n   - `data.train_bs`: Adjust batch size based on your GPU memory (default: 32)\n   - `data.n_sample_frames`: Number of sampled frames per video (default: 1)\n\n3. **Stage 2 Configuration** (`configs\u002Ftraining\u002Fstage2.yaml`):\n   - `random_init_unet`: Must be set to `False` to use the model from stage 1\n   - `data.train_bs`: Smaller batch size due to high GPU memory cost (default: 2)\n   - `data.n_sample_frames`: Higher value for temporal consistency (default: 16)\n   - `solver.gradient_accumulation_steps`: Increase to simulate larger batch sizes (default: 8)\n  \n\n### GPU Memory Requirements\nBased on our testing on a machine with 8 NVIDIA H20 GPUs:\n\n#### Stage 1 Memory Usage\n| Batch Size | Gradient Accumulation | Memory per GPU | Recommendation |\n|:----------:|:----------------------:|:--------------:|:--------------:|\n| 8          | 1                      | ~32GB          |                |\n| 16         | 1                      | ~45GB          |                |\n| 32         | 1                      | ~74GB          | ✓              |\n\n#### Stage 2 Memory Usage\n| Batch Size | Gradient Accumulation | Memory per GPU | Recommendation |\n|:----------:|:----------------------:|:--------------:|:--------------:|\n| 1          | 8                      | ~54GB          |                |\n| 2          | 2                      | ~80GB          |                |\n| 2          | 8                      | ~85GB          | ✓              |\n\n\u003Cdetails close>\n## TestCases For 1.0\n\u003Ctable class=\"center\">\n  \u003Ctr style=\"font-weight: bolder;text-align:center;\">\n        \u003Ctd width=\"33%\">Image\u003C\u002Ftd>\n        \u003Ctd width=\"33%\">MuseV\u003C\u002Ftd>\n        \u003Ctd width=\"33%\">+MuseTalk\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cimg src=assets\u002Fdemo\u002Fmusk\u002Fmusk.png width=\"95%\">\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F4a4bb2d1-9d14-4ca9-85c8-7f19c39f712e controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002Fb2a879c2-e23a-4d39-911d-51f0343218e4 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cimg src=assets\u002Fdemo\u002Fyongen\u002Fyongen.jpeg width=\"95%\">\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F57ef9dee-a9fd-4dc8-839b-3fbbbf0ff3f4 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F94d8dcba-1bcd-4b54-9d1d-8b6fc53228f0 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cimg src=assets\u002Fdemo\u002Fsit\u002Fsit.jpeg width=\"95%\">\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F5fbab81b-d3f2-4c75-abb5-14c76e51769e controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002Ff8100f4a-3df8-4151-8de2-291b09269f66 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n   \u003Ctr>\n    \u003Ctd>\n      \u003Cimg src=assets\u002Fdemo\u002Fman\u002Fman.png width=\"95%\">\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002Fa6e7d431-5643-4745-9868-8b423a454153 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F6ccf7bc7-cb48-42de-85bd-076d5ee8a623 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cimg src=assets\u002Fdemo\u002Fmonalisa\u002Fmonalisa.png width=\"95%\">\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F1568f604-a34f-4526-a13a-7d282aa2e773 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002Fa40784fc-a885-4c1f-9b7e-8f87b7caf4e0 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cimg src=assets\u002Fdemo\u002Fsun1\u002Fsun.png width=\"95%\">\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F37a3a666-7b90-4244-8d3a-058cb0e44107 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F172f4ff1-d432-45bd-a5a7-a07dec33a26b controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Cimg src=assets\u002Fdemo\u002Fsun2\u002Fsun.png width=\"95%\">\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F37a3a666-7b90-4244-8d3a-058cb0e44107 controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n    \u003Ctd >\n      \u003Cvideo src=https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk\u002Fassets\u002F163980830\u002F85a6873d-a028-4cce-af2b-6c59a1f2971d controls preload>\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable >\n\n#### Use of bbox_shift to have adjustable results(For 1.0)\n:mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness.\n\nYou can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range. \n\nFor example, in the case of `Xinying Sun`, after running the default configuration, it shows that the adjustable value rage is [-9, 9]. Then, to decrease the mouth openness, we set the value to be `-7`. \n```\npython -m scripts.inference --inference_config configs\u002Finference\u002Ftest.yaml --bbox_shift -7 \n```\n:pushpin: More technical details can be found in [bbox_shift](assets\u002FBBOX_SHIFT.md).\n\n\n#### Combining MuseV and MuseTalk\n\nAs a complete solution to virtual human generation, you are suggested to first apply [MuseV](https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https:\u002F\u002Fgithub.com\u002FTMElyralab\u002FMuseTalk?tab=readme-ov-file#inference).\n\n# Acknowledgement\n1. We thank open-source components like [whisper](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper), [dwpose](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FDWPose), [face-alignment](https:\u002F\u002Fgithub.com\u002F1adrianb\u002Fface-alignment), [face-parsing](https:\u002F\u002Fgithub.com\u002Fzllrunning\u002Fface-parsing.PyTorch), [S3FD](https:\u002F\u002Fgithub.com\u002Fyxlijun\u002FS3FD.pytorch) and [LatentSync](https:\u002F\u002Fhuggingface.co\u002FByteDance\u002FLatentSync\u002Ftree\u002Fmain). \n1. MuseTalk has referred much to [diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers) and [isaacOnline\u002Fwhisper](https:\u002F\u002Fgithub.com\u002FisaacOnline\u002Fwhisper\u002Ftree\u002Fextract-embeddings).\n1. MuseTalk has been built on [HDTF](https:\u002F\u002Fgithub.com\u002FMRzzm\u002FHDTF) datasets.\n\nThanks for open-sourcing!\n\n# Limitations\n- Resolution: Though MuseTalk uses a face region size of 256 x 256, which make it better than other open-source methods, it has not yet reached the theoretical resolution bound. We will continue to deal with this problem.  \nIf you need higher resolution, you could apply super resolution models such as [GFPGAN](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FGFPGAN) in combination with MuseTalk.\n\n- Identity preservation: Some details of the original face are not well preserved, such as mustache, lip shape and color.\n\n- Jitter: There exists some jitter as the current pipeline adopts single-frame generation.\n\n# Citation\n```bib\n@article{musetalk,\n  title={MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling},\n  author={Zhang, Yue and Zhong, Zhizhou and Liu, Minhao and Chen, Zhaokang and Wu, Bin and Zeng, Yubin and Zhan, Chao and He, Yingjie and Huang, Junxin and Zhou, Wenjiang},\n  journal={arxiv},\n  year={2025}\n}\n```\n# Disclaimer\u002FLicense\n1. `code`: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage.\n1. `model`: The trained model are available for any purpose, even commercially.\n1. `other opensource model`: Other open-source models used must comply with their license, such as `whisper`, `ft-mse-vae`, `dwpose`, `S3FD`, etc..\n1. The testdata are collected from internet, which are available for non-commercial research purposes only.\n1. `AIGC`: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.\n","MuseTalk 是一个实时高质量的唇形同步模型，能够在输入音频的驱动下修改面部视频以实现精准的唇部动作同步。该项目采用Python开发，通过在ft-mse-vae的潜在空间中训练模型，实现了对256x256大小面部区域的精确控制，并支持中文、英文和日文等多种语言的音频输入。MuseTalk利用感知损失、GAN损失及同步损失进行优化训练，结合时空数据采样策略，在保持高视觉质量的同时提升了唇形同步的准确性，能够在NVIDIA Tesla V100上达到30fps以上的实时推断速度。此项目适用于需要高质量虚拟人物解决方案的场景，如虚拟主播、在线教育或娱乐内容制作等领域。",2,"2026-06-11 03:40:16","high_star"]