[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2273":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":16,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":15,"lastSyncTime":26,"discoverSource":27},2273,"Omni2Sound","omni2sound\u002FOmni2Sound","omni2sound","Omni2Sound — Your Multimodal Audio Generation Codebase (CVPR 2026 Highlight)",null,"Python",138,3,1,0,2,6,13,50.11,"Other",false,"main",[],"2026-06-12 04:00:14","\u003Ch1 align=\"center\">Omni2Sound — Your Multimodal Audio Generation Codebase 🎵\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\u003Cb>Official repository for \u003Ci>\"Omni2Sound: Towards Unified Video-Text-to-Audio Generation\"\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">🏆 Accepted at CVPR 2026 (Highlight)\u003C\u002Fh3>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2601.02731\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2601.02731-red\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fswapforward.github.io\u002FOmni2Sound\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌐-Project%20Page-blue\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fswapforward.github.io\u002FOmni2Sound\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🎬-Demo-green\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FDalision\u002FOmni2Sound\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Model-yellow\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDalision\u002FOmni2Sound_Benchmark\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Benchmark-yellow\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fswapforward.github.io\u002FOmni2Sound\u002Fsrc\u002Fomnisound.png\" width=\"90%\">\n\u003C\u002Fp>\n\nOmni2Sound is a **unified framework** for generating temporally aligned and\nsemantically faithful audio from **video**, **text**, or **both**, handling\nthree tasks — **VT2A** (video + text → audio), **V2A** (video → audio), and\n**T2A** (text → audio) — within a single model. Our design principle is to\n_keep the model simple and push performance through data and training_:\ninstead of chasing bespoke architectures, Omni2Sound is built on a plain,\noff-the-shelf DiT backbone, and all of our gains come from a high-quality\ndataset (SoundAtlas) and a three-stage progressive multitask training\nschedule. With nothing fancy in the model, this deliberately minimal design\nstill delivers state-of-the-art performance across all three tasks and\nremains robust under challenging scenarios such as off-screen audio synthesis\nand incomplete text inputs.\n\n---\n\n## ✨ Highlights\n\n- 🏆 [**Omni2Sound**](https:\u002F\u002Fhuggingface.co\u002FDalision\u002FOmni2Sound) — unified state-of-the-art across VT2A, V2A, and T2A on the VGGSound-Omni benchmark\n- 📀 [**SoundAtlas**](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDalision\u002FOmni2Sound_Benchmark) — large-scale, high-quality A-V-T aligned audio captions, **surpassing even human-expert annotation quality**\n- 🧪 [**VGGSound-Omni**](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDalision\u002FOmni2Sound_Benchmark) — unified VTA evaluation benchmark with robustness off-screen tracks\n- 🔓 **Open-source friendly** — a multimodal extension of [stable-audio-tools](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fstable-audio-tools), with both **inference** and **finetune** code released\n\n### 🎬 Out-of-Distribution demos\n\nOmni2Sound generalizes to **stylised \u002F fictional videos** it never saw during\ntraining while preserving tight audio-visual synchronization.\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"33%\">\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F461c1c00-d545-43d6-b44c-cc0ace2b837c\" controls muted>\u003C\u002Fvideo>\n\u003C\u002Ftd>\n\u003Ctd width=\"33%\">\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F32784dd5-728b-4620-b58b-16ea9a2a2c65\" controls muted>\u003C\u002Fvideo>\n\u003C\u002Ftd>\n\u003Ctd width=\"33%\">\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F24b9a3b4-0aaf-4fa8-a5a9-98cc33525983\" controls muted>\u003C\u002Fvideo>\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Csub>\u003Ci>Gongs and drums + lip-synced chorus\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Csub>\u003Ci>Steady guitar + chorus + crackling flames\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Csub>\u003Ci>Melodic vocal with wind chimes in forest\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd width=\"33%\">\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F4047e673-c9e4-4b2e-b8c7-42da402177a7\" controls muted>\u003C\u002Fvideo>\n\u003C\u002Ftd>\n\u003Ctd width=\"33%\">\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd5f047e7-2b87-46cb-9fbb-c193c882596a\" controls muted>\u003C\u002Fvideo>\n\u003C\u002Ftd>\n\u003Ctd width=\"33%\">\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F29d02eea-5e83-4321-8572-6e8029cd51a1\" controls muted>\u003C\u002Fvideo>\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Csub>\u003Ci>Acoustic guitar synced with teenager's motion\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Csub>\u003Ci>Ice clinking synced with bubble hissing\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Csub>\u003Ci>Creaking intensifying with motion rhythm\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n👉 For the full set of demos (VT2A, V2A, caption-quality comparisons),\nsee the [🎬 project page](https:\u002F\u002Fswapforward.github.io\u002FOmni2Sound\u002F).\n\n---\n\n## 🛠️ Quick Start\n\n### 1. Environment\n\nTested on Python 3.10 + CUDA 12.1. Clone the repo, install the torch wheel\nmatching your CUDA **first**, then install the remaining requirements:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fomni2sound\u002FOmni2Sound.git\ncd Omni2Sound\n\n# 1) Install torch first (pick the CUDA wheel matching your driver)\npip install torch==2.1.0 torchaudio==2.1.0 torchvision==0.16.0 \\\n  --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121\n\n# 2) Install the rest\npip install -r requirements.txt\n```\n\n### 2. Download model weights\n\nDownload three model folders into `weights\u002F`:\n\n| Model | Source | Target directory |\n|---|---|---|\n| Omni2Sound (ours) | [Dalision\u002FOmni2Sound](https:\u002F\u002Fhuggingface.co\u002FDalision\u002FOmni2Sound) | `weights\u002Fomni2sound\u002F` |\n| DFN5B-CLIP-ViT-H-14-384 | [apple\u002FDFN5B-CLIP-ViT-H-14-384](https:\u002F\u002Fhuggingface.co\u002Fapple\u002FDFN5B-CLIP-ViT-H-14-384) | `weights\u002FDFN5B-CLIP-ViT-H-14-384\u002F` |\n| flan-t5-base | [google\u002Fflan-t5-base](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fflan-t5-base) | `weights\u002Fflan-t5-base\u002F` |\n\nOne-liner with `huggingface-cli`:\n\n```bash\nhuggingface-cli download Dalision\u002FOmni2Sound        --local-dir weights\u002Fomni2sound\nhuggingface-cli download apple\u002FDFN5B-CLIP-ViT-H-14-384 --local-dir weights\u002FDFN5B-CLIP-ViT-H-14-384\nhuggingface-cli download google\u002Fflan-t5-base        --local-dir weights\u002Fflan-t5-base\n```\n\nExpected layout:\n\n```\nweights\u002F\n├── omni2sound\u002F\n│   ├── oob_vae_16k_224410.ckpt\n│   ├── synchformer_state_dict.pth\n│   └── vt2a-24-v55vt35-oa15-mq-td15\u002F\n│       ├── args.yaml\n│       ├── data_config.yaml\n│       ├── model_config.json\n│       └── checkpoints\u002Fmodel.ckpt\n├── DFN5B-CLIP-ViT-H-14-384\u002F\n└── flan-t5-base\u002F\n```\n\n---\n\n## 🎧 Inference\n\nOmni2Sound accepts any of { video, text, video + text } and produces\ntemporally and semantically aligned audio.\n\n### Online — from raw mp4\n\nCLIP \u002F Synchformer features are extracted on the fly. The input jsonl does\n**not** need any `feature` field.\n\n```bash\nbash scripts\u002Finfer_online.sh\n```\n\n### Offline — with pre-extracted features\n\nIf you have pre-computed CLIP and Synchformer features per clip, use the\nfaster offline path:\n\n```bash\nbash scripts\u002Finfer.sh\n```\n\nBoth scripts default to the mock dataset under `data\u002Fmock_dataset\u002F`. To run\non your own data, edit `dataset_config=` at the top of the script and point\nit to your jsonl.\n\n---\n\n## 🧬 Finetune on your own data\n\nFinetuning is organised in `scripts\u002Ftrain.sh` as two sequential stages:\n\n```bash\nbash scripts\u002Ftrain.sh 0   # Stage 0: joint VT2A + V2A + T2A finetuning\nbash scripts\u002Ftrain.sh 1   # Stage 1: resume stage 0 with data augmentation\n                          #          (off-screen synthesis + text dropout)\n```\n\n> [!NOTE]\n> Multi-GPU training uses a custom `DistributedVT2ASampler` as the batch sampler.\n> PyTorch Lightning will try to re-wrap it and pass unsupported arguments (`drop_last`, etc.), causing a crash.\n> You need to patch **one function** in your Lightning installation to fix this.\n>\n> Open the file (replace `\u003Cyour_env>` with your conda env name):\n> ```\n> \u003Cconda_root>\u002Fenvs\u002F\u003Cyour_env>\u002Flib\u002Fpython3.10\u002Fsite-packages\u002Fpytorch_lightning\u002Futilities\u002Fdata.py\n> ```\n> Find the function `_dataloader_init_kwargs_resolve_sampler` (around line 230), and insert an **early return** before the `if batch_sampler is not None` line:\n> ```python\n> batch_sampler = getattr(dataloader, \"batch_sampler\")\n> batch_sampler_cls = type(batch_sampler)\n>\n> # --- patch: skip Lightning's batch-sampler re-wrapping ---\n> return {\n>     \"sampler\": None,\n>     \"shuffle\": False,\n>     \"batch_sampler\": batch_sampler,\n>     \"batch_size\": 1,\n>     \"drop_last\": False,\n> }\n>\n> if batch_sampler is not None and (batch_sampler_cls is not BatchSampler or is_predicting):\n> ```\n\n- **Stage 0 — Multi-task Interleaved Finetuning.**\n  Jointly optimises VT2A, V2A, and T2A on paired (V, T, A) triplets plus\n  T2A \u002F V2A data, using a shared DiT backbone.\n- **Stage 1 — Decoupled Robustness Finetuning.**\n  Continues from the Stage-0 checkpoint with two **push–pull synergistic\n  augmentations**: *off-screen synthesis* counteracts video bias, while\n  *text dropout* counteracts text bias, keeping cross-modal reliance\n  balanced under asymmetric input scenarios.\n\nBatch feature extraction (CLIP and Synchformer) follows the pipeline of\n[MMAudio](https:\u002F\u002Fgithub.com\u002Fhkchengrex\u002FMMAudio) — please refer to that repo\nfor the extractor setup.\n\n---\n\n## 🧪 Benchmark\n\nTwo benchmarks are released at\n[🤗 Dalision\u002FOmni2Sound_Benchmark](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDalision\u002FOmni2Sound_Benchmark):\n\n- **SoundAtlas** — a large-scale A-V-T triple dataset augmenting AudioSet and\n  VGGSound, produced by an agentic annotation pipeline that combines\n  Vision-to-Language Compression (to mitigate visual hallucinations) with a\n  Junior–Senior Agent Handoff (for **5× cost reduction**), delivering\n  captions that **surpass even human-expert quality**.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fswapforward.github.io\u002FOmni2Sound\u002Fsrc\u002Fpipeline.png\" width=\"90%\">\u003Cbr>\n  \u003Cem>The SoundAtlas agentic annotation pipeline.\u003C\u002Fem>\n\u003C\u002Fp>\n\n- **VGGSound-Omni** — unified VT2A \u002F V2A \u002F T2A evaluation, including\n  off-screen audio and incomplete-text robustness tracks.\n\n---\n\n## 📝 Citation\n\nIf you find this work useful, please cite:\n\n```bibtex\n@article{dai2026omni2sound,\n  title   = {Omni2Sound: Towards Unified Video-Text-to-Audio Generation},\n  author  = {Dai, Yusheng and Chen, Zehua and Jiang, Yuxuan and Gao, Baolong and\n             Ke, Qiuhong, Cai, Jianfei and Zhu, Jun},\n  journal = {arXiv preprint arXiv:2601.02731},\n  year    = {2026}\n}\n```\n\n---\n\n## 📧 Contact\n\nIf you have any comments or questions, feel free to contact: yusheng.dai@monash.edu\n\n---\n\n## 🙏 Acknowledgements\n\n- [stable-audio-tools](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fstable-audio-tools) — codebase, DiT backbone and OOB VAE\n- [FreeAudio](https:\u002F\u002Ffreeaudio.github.io\u002FFreeAudio\u002F) — T2A pretrain + Wav VAE\n- [MMAudio](https:\u002F\u002Fgithub.com\u002Fhkchengrex\u002FMMAudio) — feature extraction pipeline\n- [Synchformer](https:\u002F\u002Fgithub.com\u002Fv-iashin\u002FSynchformer) — synchronisation features\n- [AudioX](https:\u002F\u002Fgithub.com\u002FZeyueT\u002FAudioX) — prior work on unified multimodal audio generation\n\n---\n\n## 📄 License\n\nBoth the code and the model weights are released under\n[**CC BY-NC 4.0**](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F)\n(non-commercial use only).\n","Omni2Sound 是一个用于从视频、文本或两者生成时间对齐且语义准确音频的统一框架。该项目能够处理视频+文本转音频（VT2A）、视频转音频（V2A）以及文本转音频（T2A）三种任务，并在单一模型中实现这些功能。技术上，Omni2Sound 基于现成的 DiT 模型架构，通过高质量的数据集 SoundAtlas 和三阶段渐进式多任务训练计划来提升性能，而无需复杂的定制化架构。这使得 Omni2Sound 在所有三个任务上都能达到最先进的表现，特别是在处理屏幕外声音合成和不完整文本输入等挑战性场景时依然保持稳健。该工具适用于需要高质量、多样化音频生成的应用场景，如电影制作、游戏开发或任何需要将视觉与听觉内容结合的创意项目。此外，其开源友好的特性也便于研究者和开发者进行进一步的研究与应用扩展。","2026-06-11 02:49:11","CREATED_QUERY"]