[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72494":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":16,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},72494,"V-Express","tencent-ailab\u002FV-Express","tencent-ailab","V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images.",null,"Python",2361,296,34,39,0,1,3,59.02,false,"main",[],"2026-06-12 04:01:06","# **_V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation_**\n\n\u003Ca href='https:\u002F\u002Ftenvence.github.io\u002Fp\u002Fv-express\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-green'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02511'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTechnique-Report-red'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002Ftk93\u002FV-Express'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Model-blue'>\u003C\u002Fa>\n[![GitHub](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftencent-ailab\u002FV-Express?style=social)](https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002F)\n\n---\n\n## Introduction\n\nIn the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent.\nA common approach involves leveraging generative models to enhance adapters for controlled generation.\nHowever, control signals can vary in strength, including text, audio, image reference, pose, depth map, etc.\nAmong these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions.\nIn our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as pose and original image.\nHowever, direct training with weak signals often leads to difficulties in convergence.\nTo address this, we propose V-Express, a simple method that balances different control signals through a series of progressive drop operations.\nOur method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account pose, input image, and audio.\n\n\u003Cimg width=\"1000\" alt=\"global_framework\" src=\"https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Fassets\u002F19601425\u002F0236e48a-a95e-4d9b-9c28-6d37e1cc3c5a\">\n\n## Release\n\n- [2024\u002F10\u002F11] 🔥 We release the training code.\n- [2024\u002F06\u002F15] 🔥 We have optimized memory usage, now supporting the generation of longer videos.\n- [2024\u002F06\u002F05] 🔥 We have released the technique report on [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02511).\n- [2024\u002F06\u002F03] 🔥 If you are using ComfyUI, you can try [ComfyUI-V-Express](https:\u002F\u002Fgithub.com\u002Ftiankuan93\u002FComfyUI-V-Express).\n- [2024\u002F05\u002F29] 🔥 We have added video post-processing that can effectively mitigate the flicker problem.\n- [2024\u002F05\u002F23] 🔥 We release the code and models.\n\n## Installation\n\n```\n# download the codes\ngit clone https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\n\n# install requirements\ncd V-Express\npip install -r requirements.txt\n\n# download the models\ngit lfs install\ngit clone https:\u002F\u002Fhuggingface.co\u002Ftk93\u002FV-Express\nmv V-Express\u002Fmodel_ckpts model_ckpts\nmv V-Express\u002F*.bin model_ckpts\u002Fv-express\n\n# then you can use the scripts\n```\n\n## Download Models\n\nYou can download models from [here](https:\u002F\u002Fhuggingface.co\u002Ftk93\u002FV-Express). We have included all the required models in the model card. You can also download the models separately from the original repository.\n\n- [stabilityai\u002Fsd-vae-ft-mse](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fsd-vae-ft-mse).\n- [runwayml\u002Fstable-diffusion-v1-5](https:\u002F\u002Fhuggingface.co\u002Frunwayml\u002Fstable-diffusion-v1-5). Only the model configuration file for unet is needed here.\n- [facebook\u002Fwav2vec2-base-960h](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fwav2vec2-base-960h).\n- [insightface\u002Fbuffalo_l](https:\u002F\u002Fgithub.com\u002Fdeepinsight\u002Finsightface\u002Freleases\u002Fdownload\u002Fv0.7\u002Fbuffalo_l.zip).\n\n## How to Train\n\n### Data preparation\nFor details, please refer to README of [prepare_dataset](https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Ftree\u002Fmain\u002Fscripts\u002Fprepare_dataset#training-data).\n\n### Training\n- stage_1, train use train_stage.sh and set `config=.\u002Ftraining_configs\u002Fstage_1.yaml`\n- stage_2, train use train_stage.sh and set `config=.\u002Ftraining_configs\u002Fstage_2.yaml`\n- stage_3, train use train_stage.sh and set `config=.\u002Ftraining_configs\u002Fstage_3.yaml`\n\n\n## How to Use\n\n### Important Reminder\n\n${\\color{red}Important! Important!! Important!!!}$\n\nIn the talking-face generation task, when the target video is not the same person as the reference character, the retarget of the face will be a \u003Cspan style=\"color:red\">very important\u003C\u002Fspan> part. And choosing a target video that is more similar to the pose of the reference face will be able to get better results. In addition, our model now performs better on English, and other languages have not yet been tested in detail.\n\n### Run the demo (step1, _optional_)\n\nIf you have a target talking video, you can follow the script below to extract the audio and face V-kps sequences from the video. You can also skip this step and run the script in Step 2 directly to try the example we provided.\n\n```shell\npython scripts\u002Fextract_kps_sequence_and_audio.py \\\n    --video_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Fgt.mp4\" \\\n    --kps_sequence_save_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Fkps.pth\" \\\n    --audio_save_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Faud.mp3\"\n```\n\nWe recommend cropping a clear square face image as in the example below and making sure the resolution is no lower than 512x512. The green to red boxes in the image below are the recommended cropping ranges.\n\n\u003Cimg width=\"500\" alt=\"crop_example\" src=\"https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Fassets\u002F19601425\u002F7c1d8df4-7267-46c7-a848-5130476467ef\">\n\n### Run the demo (step2, _core_)\n\n**Scenario 1 (A's picture and A's talking video.) (Best Practice)**\n\nIf you have a picture of A and a talking video of A in another scene. Then you should run the following script. Our model is able to generate speaking videos that are consistent with the given video. _You can see more examples on our [project page](https:\u002F\u002Ftenvence.github.io\u002Fp\u002Fv-express\u002F)._\n\n```shell\npython inference.py \\\n    --reference_image_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Fref.jpg\" \\\n    --audio_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Faud.mp3\" \\\n    --kps_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Fkps.pth\" \\\n    --output_path \".\u002Foutput\u002Fshort_case\u002Ftalk_AOC_no_retarget.mp4\" \\\n    --retarget_strategy \"no_retarget\" \\\n    --num_inference_steps 25\n```\n\n\u003Ctr>\n    \u003Ctd colspan=\"4\" style=\"text-align:center;\">\n      \u003Cvideo muted=\"\" autoplay=\"autoplay\" loop=\"loop\" src=\"https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Fassets\u002F19601425\u002F17dd4103-eaf7-4045-8bc0-e90093deaee8\" style=\"width: 80%; height: auto;\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n\u003C\u002Ftr>\n\n${\\color{red}New!!!}$ We have optimized memory usage, now supporting the generation of longer videos. For a 31-second audio, it requires a peak memory of 7956MiB in a V100 test environment, with a total processing time of 2617.4 seconds. You can try it with the following script.\n\n> [!NOTE]\n> The `.\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Fv_exprss_intro_chattts.mp3` is a long audio clip of about 30 seconds generated using [ChatTTS](https:\u002F\u002Fgithub.com\u002F2noise\u002FChatTTS), where we just need to enter a piece of text. We then use V-Express to generate a portrait video. This is probably an interesting pipeline.\n\n```shell\npython inference.py \\\n    --reference_image_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Fref.jpg\" \\\n    --audio_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002Fv_exprss_intro_chattts.mp3\" \\\n    --kps_path \".\u002Ftest_samples\u002Fshort_case\u002FAOC\u002FAOC_raw_kps.pth\" \\\n    --output_path \".\u002Foutput\u002Fshort_case\u002Ftalk_AOC_raw_kps_chattts_no_retarget.mp4\" \\\n    --retarget_strategy \"no_retarget\" \\\n    --num_inference_steps 25 \\\n    --reference_attention_weight 1.0 \\\n    --audio_attention_weight 1.0 \\\n    --save_gpu_memory\n```\n\n\u003Ctr>\n    \u003Ctd colspan=\"4\" style=\"text-align:center;\">\n      \u003Cvideo muted=\"\" autoplay=\"autoplay\" loop=\"loop\" src=\"https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Fassets\u002F19601425\u002F3bb4c10b-eb25-4a92-81af-1b1e3269f334\" style=\"width: 40%; height: auto;\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n\u003C\u002Ftr>\n\n**Scenario 2 (A's picture and any talking audio.)**\n\nIf you only have a picture and any talking audio. With the following script, our model can generate vivid mouth movements for fixed faces.\n\n```shell\npython inference.py \\\n    --reference_image_path \".\u002Ftest_samples\u002Fshort_case\u002Ftys\u002Fref.jpg\" \\\n    --audio_path \".\u002Ftest_samples\u002Fshort_case\u002Ftys\u002Faud.mp3\" \\\n    --output_path \".\u002Foutput\u002Fshort_case\u002Ftalk_tys_fix_face.mp4\" \\\n    --retarget_strategy \"fix_face\" \\\n    --num_inference_steps 25\n```\n\n\u003Ctr>\n    \u003Ctd colspan=\"4\" style=\"text-align:center;\">\n      \u003Cvideo muted=\"\" autoplay=\"autoplay\" loop=\"loop\" src=\"https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Fassets\u002F19601425\u002Ffe782c16-f341-424d-83ce-89531af2a292\" style=\"width: 40%; height: auto;\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n\u003C\u002Ftr>\n\n**Scenario 3 (A's picture and B's talking video.)**\n\n- With the script below, our model generates vivid mouth movements accompanied by slight facial motion.\n\n```shell\npython inference.py \\\n    --reference_image_path \".\u002Ftest_samples\u002Fshort_case\u002Ftys\u002Fref.jpg\" \\\n    --audio_path \".\u002Ftest_samples\u002Fshort_case\u002Ftys\u002Faud.mp3\" \\\n    --kps_path \".\u002Ftest_samples\u002Fshort_case\u002Ftys\u002Fkps.pth\" \\\n    --output_path \".\u002Foutput\u002Fshort_case\u002Ftalk_tys_offset_retarget.mp4\" \\\n    --retarget_strategy \"offset_retarget\" \\\n    --num_inference_steps 25\n```\n\n\u003Ctr>\n    \u003Ctd colspan=\"4\" style=\"text-align:center;\">\n      \u003Cvideo muted=\"\" autoplay=\"autoplay\" loop=\"loop\" src=\"https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Fassets\u002F19601425\u002F4951d06c-579d-499e-994d-14fa7e524713\" style=\"width: 40%; height: auto;\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n\u003C\u002Ftr>\n\n- With the following script, our model generates a video with the same movements as the target video, and the character's lip-synching matches the target audio.\n\n> [!NOTE]\n> We have only implemented the very naive retarget strategy so far, which allows us to achieve driving the reference face with different character videos under limited conditions. To get better results, we strongly recommend you to choose a target video that is closer to the reference face. We are also trying to implement a more robust face retargeting strategy, which hopefully can further solve the problem of inconsistency between the reference face and the target face. We also welcome experienced people who can help.\n\n```shell\npython inference.py \\\n    --reference_image_path \".\u002Ftest_samples\u002Fshort_case\u002Ftys\u002Fref.jpg\" \\\n    --audio_path \".\u002Ftest_samples\u002Fshort_case\u002Ftys\u002Faud.mp3\" \\\n    --kps_path \".\u002Ftest_samples\u002Fshort_case\u002Ftys\u002Fkps.pth\" \\\n    --output_path \".\u002Foutput\u002Fshort_case\u002Ftalk_tys_naive_retarget.mp4\" \\\n    --retarget_strategy \"naive_retarget\" \\\n    --num_inference_steps 25 \\\n    --reference_attention_weight 1.0 \\\n    --audio_attention_weight 1.0\n```\n\n\u003Ctr>\n    \u003Ctd colspan=\"4\" style=\"text-align:center;\">\n      \u003Cvideo muted=\"\" autoplay=\"autoplay\" loop=\"loop\" src=\"https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Fassets\u002F19601425\u002Fd555ed02-56eb-44e5-94e5-772edcd3338b\" style=\"width: 40%; height: auto;\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n\u003C\u002Ftr>\n\n### More parameters\n\nFor different types of input condition, such as reference image and target audio, we provide parameters for adjusting the role played by that condition information in the model prediction. We refer to these two parameters as `reference_attention_weight` and `audio_attention_weight`. Different parameters can be applied to achieve different effects using the following script. `Through our experiments, we suggest that reference_attention_weight takes the value 0.9-1.0 and audio_attention_weight takes the value 1.0-3.0.`\n\n```shell\npython inference.py \\\n    --reference_image_path \".\u002Ftest_samples\u002Fshort_case\u002F10\u002Fref.jpg\" \\\n    --audio_path \".\u002Ftest_samples\u002Fshort_case\u002F10\u002Faud.mp3\" \\\n    --output_path \".\u002Foutput\u002Fshort_case\u002Ftalk_10_fix_face_with_weight.mp4\" \\\n    --retarget_strategy \"fix_face\" \\    # this strategy do not need kps info\n    --reference_attention_weight 0.95 \\\n    --audio_attention_weight 3.0\n```\n\nWe show the different effects produced by different parameters in the following video. You can adjust the parameters accordingly to your needs.\n\n\u003Ctr>\n    \u003Ctd colspan=\"4\" style=\"text-align:center;\">\n      \u003Cvideo muted=\"\" autoplay=\"autoplay\" loop=\"loop\" src=\"https:\u002F\u002Fgithub.com\u002Ftencent-ailab\u002FV-Express\u002Fassets\u002F19601425\u002F2e977b8c-c69b-4815-8565-d4d7c3c349a9\" style=\"width: 100%; height: auto;\">\u003C\u002Fvideo>\n    \u003C\u002Ftd>\n\u003C\u002Ftr>\n\n## Acknowledgements\n\nWe would like to thank the contributors to the [magic-animate](https:\u002F\u002Fgithub.com\u002Fmagic-research\u002Fmagic-animate), [AnimateDiff](https:\u002F\u002Fgithub.com\u002Fguoyww\u002FAnimateDiff), [sd-webui-controlnet](https:\u002F\u002Fgithub.com\u002FMikubill\u002Fsd-webui-controlnet\u002Fdiscussions\u002F1236), and [Moore-AnimateAnyone](https:\u002F\u002Fgithub.com\u002FMooreThreads\u002FMoore-AnimateAnyone) repositories, for their open research and exploration.\n\nThe code of V-Express is released for both academic and commercial usage. However, both manual-downloading and auto-downloading models from V-Express are for non-commercial research purposes. Our released checkpoints are also for research purposes only. Users are granted the freedom to create videos using this tool, but they are obligated to comply with local laws and utilize it responsibly. The developers will not assume any responsibility for potential misuse by users.\n\n## Citation\n\nIf you find V-Express useful for your research and applications, please cite using this BibTeX:\n\n```bibtex\n@article{wang2024V-Express,\n  title={V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation},\n  author={Wang, Cong and Tian, Kuan and Zhang, Jun and Guan, Yonghang and Luo, Feng and Shen, Fei and Jiang, Zhiwei and Gu, Qing and Han, Xiao and Yang, Wei},\n  booktitle={arXiv preprint arXiv:2406.02511},\n  year={2024}\n}\n```\n","V-Express 是一个用于根据参考图像、音频和一系列V-Kps图像生成说话头像视频的项目。其核心功能是通过一系列渐进式的丢弃操作来平衡不同控制信号的影响，使得较弱的控制信号（如音频）也能有效发挥作用，从而实现同时考虑姿态、输入图像和音频的视频生成。该项目使用Python开发，并采用了一种名为条件丢弃的方法以解决训练过程中弱信号难以收敛的问题。V-Express适合于需要高质量个性化视频内容生成的应用场景，例如虚拟主播、在线教育中的讲师形象合成等领域。",2,"2026-06-11 03:42:19","high_star"]