[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72558":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},72558,"MMAudio","hkchengrex\u002FMMAudio","hkchengrex","[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis","https:\u002F\u002Fhkchengrex.com\u002FMMAudio\u002F",null,"Python",2201,259,21,9,0,4,5,29,12,72.64,"MIT License",false,"main",true,[27,28,29,30,31,32],"audio","audio-synthesis","computer-vision","deep-learning","text-to-audio","video-to-audio","2026-06-12 04:01:06","\u003Cdiv align=\"center\">\n\u003Cp align=\"center\">\n  \u003Ch2>MMAudio\u003C\u002Fh2>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15322\">Paper\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fhkchengrex.github.io\u002FMMAudio\">Webpage\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fhkchengrex\u002FMMAudio\u002Ftree\u002Fmain\">Models\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhkchengrex\u002FMMAudio\"> Huggingface Demo\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing\">Colab Demo\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Freplicate.com\u002Fzsxkib\u002Fmmaudio\">Replicate Demo\u003C\u002Fa>\n\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https:\u002F\u002Fhkchengrex.github.io\u002FMMAudio)\n\n[Ho Kei Cheng](https:\u002F\u002Fhkchengrex.github.io\u002F), [Masato Ishii](https:\u002F\u002Fscholar.google.co.jp\u002Fcitations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=XCRO260AAAAJ), [Alexander Schwing](https:\u002F\u002Fwww.alexander-schwing.de\u002F), [Yuki Mitsufuji](https:\u002F\u002Fwww.yukimitsufuji.com\u002F)\n\nUniversity of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation\n\nCVPR 2025\n\n## Highlight\n\nMMAudio generates synchronized audio given video and\u002For text inputs.\nOur key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.\nMoreover, a synchronization module aligns the generated audio with the video frames.\n\nCheck out this fun video:\n\n[![Does Your Voice Match Your Face?](https:\u002F\u002Fimg.youtube.com\u002Fvi\u002FSLz3NWLyHxg\u002F0.jpg)](https:\u002F\u002Fyoutu.be\u002FSLz3NWLyHxg)\n\n[[Does Your Voice Match Your Face? https:\u002F\u002Fyoutu.be\u002FSLz3NWLyHxg]](https:\u002F\u002Fyoutu.be\u002FSLz3NWLyHxg)\n\n## Results\n\n(All audio from our algorithm MMAudio)\n\nVideos from Sora:\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F82afd192-0cee-48a1-86ca-bd39b8c8f330\n\nVideos from Veo 2:\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F8a11419e-fee2-46e0-9e67-dfb03c48d00e\n\nVideos from MovieGen\u002FHunyuan Video\u002FVGGSound:\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F29230d4e-21c1-4cf8-a221-c28f2af6d0ca\n\nFor more results, visit https:\u002F\u002Fhkchengrex.com\u002FMMAudio\u002Fvideo_main.html.\n\n\n## Installation\n\nWe have only tested this on Ubuntu.\n\n### Prerequisites\n\nWe recommend using a [miniforge](https:\u002F\u002Fgithub.com\u002Fconda-forge\u002Fminiforge) environment.\n\n- Python 3.9+\n- PyTorch **2.5.1+** and corresponding torchvision\u002Ftorchaudio (pick your CUDA version https:\u002F\u002Fpytorch.org\u002F, pip install recommended)\n\u003C!-- - ffmpeg\u003C7 ([this is required by torchaudio](https:\u002F\u002Fpytorch.org\u002Faudio\u002Fmaster\u002Finstallation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg\u003C7'`) -->\n\n**1. Install prerequisite if not yet met:**\n\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118 --upgrade\n```\n\n(Or any other CUDA versions that your GPUs\u002Fdriver support)\n\n\u003C!-- ```\nconda install -c conda-forge 'ffmpeg\u003C7\n```\n(Optional, if you use miniforge and don't already have the appropriate ffmpeg) -->\n\n**2. Clone our repository:**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhkchengrex\u002FMMAudio.git\n```\n\n**3. Install with pip (install pytorch first before attempting this!):**\n\n```bash\ncd MMAudio\npip install -e .\n```\n\n(If you encounter the File \"setup.py\" not found error, upgrade your pip with pip install --upgrade pip)\n\n\n**Pretrained models:**\n\nThe models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio\u002Futils\u002Fdownload_utils.py`.\nThe models are also available at https:\u002F\u002Fhuggingface.co\u002Fhkchengrex\u002FMMAudio\u002Ftree\u002Fmain\nSee [MODELS.md](docs\u002FMODELS.md) for more details.\n\n## Demo\n\nBy default, these scripts use the `large_44k_v2` model. \nIn our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.\n\n### Command-line interface\n\nWith `demo.py`\n\n```bash\npython demo.py --duration=8 --video=\u003Cpath to video> --prompt \"your prompt\" \n```\n\nThe output (audio in `.flac` format, and video in `.mp4` format) will be saved in `.\u002Foutput`.\nSee the file for more options.\nSimply omit the `--video` option for text-to-audio synthesis.\nThe default output (and training) duration is 8 seconds. Longer\u002Fshorter durations could also work, but a large deviation from the training duration may result in a lower quality.\n\n### Gradio interface\n\nSupports video-to-audio and text-to-audio synthesis.\nYou can also try experimental image-to-audio synthesis which duplicates the input image to a video for processing. This might be interesting to some but it is not something MMAudio has been trained for.\nUse [port forwarding](https:\u002F\u002Funix.stackexchange.com\u002Fquestions\u002F115897\u002Fwhats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can specify with `--port`.\n\n```bash\npython gradio_demo.py\n```\n\n### FAQ\n\n1. Video processing\n    - Processing higher-resolution videos takes longer due to encoding and decoding (which can take >95% of the processing time!), but it does not improve the quality of results.\n    - The CLIP encoder resizes input frames to 384×384 pixels. \n    - Synchformer resizes the shorter edge to 224 pixels and applies a center crop, focusing only on the central square of each frame.\n2. Frame rates\n    - The CLIP model operates at 8 FPS, while Synchformer works at 25 FPS.\n    - Frame rate conversion happens on-the-fly via the video reader.\n    - For input videos with a frame rate below 25 FPS, frames will be duplicated to match the required rate.\n3. Failure cases\nAs with most models of this type, failures can occur, and the reasons are not always clear. Below are some known failure modes. If you notice a failure mode or believe there’s a bug, feel free to open an issue in the repository.\n4. Performance variations\nWe notice that there can be subtle performance variations in different hardware and software environments. Some of the reasons include using\u002Fnot using `torch.compile`, video reader library\u002Fbackend, inference precision, batch sizes, random seeds, etc. We (will) provide pre-computed results on standard benchmark for reference. Results obtained from this codebase should be similar but might not be exactly the same.\n\n### Known limitations\n\n1. The model sometimes generates unintelligible human speech-like sounds\n2. The model sometimes generates background music (without explicit training, it would not be high quality)\n3. The model struggles with unfamiliar concepts, e.g., it can generate \"gunfires\" but not \"RPG firing\".\n\nWe believe all of these three limitations can be addressed with more high-quality training data.\n\n## Training\n\nSee [TRAINING.md](docs\u002FTRAINING.md).\n\n## Evaluation\n\nSee [EVAL.md](docs\u002FEVAL.md).\n\n## Training Datasets\n\nMMAudio was trained on several datasets, including [AudioSet](https:\u002F\u002Fresearch.google.com\u002Faudioset\u002F), [Freesound](https:\u002F\u002Fgithub.com\u002FLAION-AI\u002Faudio-dataset\u002Fblob\u002Fmain\u002Flaion-audio-630k\u002FREADME.md), [VGGSound](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fdata\u002Fvggsound\u002F), [AudioCaps](https:\u002F\u002Faudiocaps.github.io\u002F), and [WavCaps](https:\u002F\u002Fgithub.com\u002FXinhaoMei\u002FWavCaps). These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.\n\n## Update Logs\n\n- 2025-03-09: Uploaded the corrected tsv files. See [TRAINING.md](docs\u002FTRAINING.md).\n- 2025-02-27: Disabled the GradScaler by default to improve training stability. See #49.\n- 2024-12-23: Added training and batch evaluation scripts.\n- 2024-12-14: Removed the `ffmpeg\u003C7` requirement for the demos by replacing `torio.io.StreamingMediaDecoder` with `pyav` for reading frames. The read frames are also cached, so we are not reading the same frames again during reconstruction. This should speed things up and make installation less of a hassle.\n- 2024-12-13: Improved for-loop processing in CLIP\u002FSync feature extraction by introducing a batch size multiplier. We can approximately use 40x batch size for CLIP\u002FSync without using more memory, thereby speeding up processing. Removed VAE encoder during inference -- we don't need it.\n- 2024-12-11: Replaced `torio.io.StreamingMediaDecoder` with `pyav` for reading framerate when reconstructing the input video. `torio.io.StreamingMediaDecoder` does not work reliably in huggingface ZeroGPU's environment, and I suspect that it might not work in some other environments as well.\n\n## Citation\n\n```bibtex\n@inproceedings{cheng2025taming,\n  title={{MMAudio}: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},\n  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},\n  booktitle={CVPR},\n  year={2025}\n}\n```\n\n## Relevant Repositories\n\n- [av-benchmark](https:\u002F\u002Fgithub.com\u002Fhkchengrex\u002Fav-benchmark) for benchmarking results.\n\n## License\n- The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE)\n- The checkpoints are released on Hugging Face under the CC-BY-NC 4.0 license as found at [https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F).\n\n## Disclaimer\n\nWe have no affiliation with and have no knowledge of the party behind the domain \"mmaudio.net\".\n\n## Acknowledgement\n\nMany thanks to:\n- [Make-An-Audio 2](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FMake-An-Audio-2) for the 16kHz BigVGAN pretrained model and the VAE architecture\n- [BigVGAN](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN)\n- [Synchformer](https:\u002F\u002Fgithub.com\u002Fv-iashin\u002FSynchformer) \n- [EDM2](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fedm2) for the magnitude-preserving VAE network architecture\n","MMAudio 是一个用于高质量视频到音频合成的项目，它能够根据视频和\u002F或文本输入生成同步音频。该项目的核心创新在于多模态联合训练技术，支持在广泛的视听和音频-文本数据集上进行训练，并通过同步模块确保生成的音频与视频帧对齐。采用 Python 编写，基于深度学习框架 PyTorch 实现。适用于需要将视觉内容转换为匹配声音的应用场景，比如电影制作、虚拟现实体验增强等多媒体内容创作领域。",2,"2026-06-11 03:42:34","high_star"]