[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77734":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":22,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":13,"lastSyncTime":27,"discoverSource":28},77734,"WavFlow","facebookresearch\u002FWavFlow","facebookresearch","MultiModal Audio Generation in Raw Waveform Space.",null,"Python",151,9,2,1,0,12,118,59,"Other",false,"main",true,[],"2026-06-12 04:01:22","\u003Cdiv align=\"center\">\n\n\u003Ch1 align=\"center\">\u003Cfont color=\"#1877F2\">WavFlow\u003C\u002Ffont>: Audio Generation in Waveform Space\u003C\u002Fh1>\n\n[Feiyan Zhou](https:\u002F\u002Fzhoufeiyn.github.io\u002F)\u003Csup>1,2\u003C\u002Fsup> ·\n[Luyuan Wang](https:\u002F\u002Fwww.luyuan.wang\u002F)\u003Csup>1\u003C\u002Fsup> ·\n[Shoufa Chen](https:\u002F\u002Fwww.shoufachen.com\u002F)\u003Csup>1,*\u003C\u002Fsup> ·\n[Zhe Wang](https:\u002F\u002Fwangzheallen.github.io\u002F)\u003Csup>1\u003C\u002Fsup> ·\n[Zhiheng Liu](https:\u002F\u002Fjohanan528.github.io\u002F)\u003Csup>1\u003C\u002Fsup> ·\n[Yuren Cong](https:\u002F\u002Fyrcong.github.io\u002F)\u003Csup>1\u003C\u002Fsup> ·\n[Xiaohui Zhang](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fxiaohui-zhang-79569539)\u003Csup>1\u003C\u002Fsup> ·\n[Fanny Yang](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Ffanny-yang-035861128)\u003Csup>1\u003C\u002Fsup> ·\n[Belinda Zeng](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fbelindazeng)\u003Csup>1\u003C\u002Fsup>\n\n\u003Csup>1\u003C\u002Fsup> Meta AI &nbsp;·&nbsp; \u003Csup>2\u003C\u002Fsup> Northeastern University\n\n[**🌐 Project Page**](https:\u002F\u002Ffacebookresearch.github.io\u002FWavFlow\u002F) &nbsp;·&nbsp; [**📄 arXiv**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18749) &nbsp;·&nbsp; [**🛠 Training Guide**](TRAINING.md)\n\n\u003C\u002Fdiv>\n\n---\n\n## Overview\n\n**WavFlow** introduces a paradigm for generating synchronized, high-fidelity audio from video and text inputs directly in the **raw waveform space**, bypassing latent compression entirely. Through *waveform patchifying* and *amplitude lifting*, WavFlow enables stable flow matching on raw audio via direct *x*-prediction. Evaluation on the VGGSound (VT2A) and AudioCaps (T2A) benchmarks shows that WavFlow delivers performance on par with established latent-based methods, proving that end-to-end waveform generation can match traditional frameworks in acoustic richness, fidelity, and synchronization.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Foverview.png\" width=\"78%\" alt=\"WavFlow overview\"\u002F>\n\u003C\u002Fp>\n\n## Demo\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"50%\" align=\"center\">\n\n**🌳 Forest** *(natural)*\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F9828418c-28e8-4c4c-8b93-93d5b8b57358\n\n\u003C\u002Ftd>\n\u003Ctd width=\"50%\" align=\"center\">\n\n**🐸 Frog** *(animal)*\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F25d1c2ed-7023-4500-bf29-223988cad4a0\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd width=\"50%\" align=\"center\">\n\n**🥁 Drum** *(music)*\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3cd0fdbf-a03a-4a15-a69f-99d0c42122b1\n\n\u003C\u002Ftd>\n\u003Ctd width=\"50%\" align=\"center\">\n\n**🛹 Skateboard** *(sport)*\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F1c572eff-13b6-48aa-89b9-5f13a6924419\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n> See the **[Project Page](https:\u002F\u002Ffacebookresearch.github.io\u002FWavFlow\u002F)** for 24+ samples and side-by-side benchmark comparisons.\n\n## Method\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002FArchitecture.png\" width=\"92%\" alt=\"WavFlow architecture\"\u002F>\n\u003C\u002Fp>\n\n## Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FWavFlow.git\ncd WavFlow\nbash scripts\u002Fsetup.sh        # creates conda env 'wavflow' and installs everything\nconda activate wavflow\n```\n\n\u003Cdetails>\n\u003Csummary>Manual setup\u003C\u002Fsummary>\n\n```bash\nconda create -n wavflow python=3.10 -y\nconda activate wavflow\npip install -r requirements.txt\npip install -e . --no-deps\nconda install -n wavflow -c conda-forge \"ffmpeg\u003C7\" -y    # for torio video decoding\n```\n\n\u003C\u002Fdetails>\n\n> All required external weights (CLIP, Synchformer, the empty-string CFG embedding) are downloaded or computed automatically on first run and cached under `~\u002F.cache\u002Fwavflow\u002F`.\n\n## Inference\n\n> ⚠️ Due to organizational policy constraints, we are currently unable to release the production-trained checkpoints. We are working on a foundation checkpoint trained on fully open-source data; in the meantime you can train your own — see the [training guide](TRAINING.md).\n\nOnce you have a trained checkpoint, run:\n\n```bash\nbash scripts\u002Flaunch\u002Fpredict.sh [--gpu N] [--config PATH]\n```\n\nThe default config is `wavflow\u002Fconfigs\u002Finfer.yaml`. The input CSV (`data.csv_path`) accepts video, text, or both:\n\n```csv\nvideo_path,caption,video_exist,text_exist\n\u002Fabs\u002Fpath\u002Fsample1.mp4,a whistling rocket explodes,1,1   # video + text\n\u002Fabs\u002Fpath\u002Fsample2.mp4,birds chirping in a forest,1,1    # video + text\n,a whistling rocket explodes,0,1                        # text-only\n\u002Fabs\u002Fpath\u002Fsample3.mp4,,1,0                              # video-only\n```\n\n\u003Cdetails>\n\u003Csummary>Configuration reference\u003C\u002Fsummary>\n\n#### Launcher options\n\n| Flag \u002F env | Default | Description |\n|---|---|---|\n| `--gpu N` *(or `GPU=N`)* | `0` | CUDA device index |\n| `--config PATH` *(or `CONFIG_PATH=...`)* | `wavflow\u002Fconfigs\u002Finfer.yaml` | YAML config to load |\n| `WAVFLOW_ENV` | `wavflow` | conda env name to auto-activate |\n\nAny extra positional argument is forwarded to `python -m wavflow.infer`.\n\n#### Key fields in `infer.yaml`\n\n| Field | What to set |\n|---|---|\n| `data.csv_path` | the input CSV (above) |\n| `model.name` | one of `medium_16k`, `medium_44k`, `large_16k`, `large_44k` (must match the trained ckpt) |\n| `model.ckpt_path` | a `checkpoint_*.pth` (full ckpt) or `ema_epoch_*.pth` (EMA-only) |\n| `model.use_ema` | `true` to load `model_ema1` from a full ckpt; `false` to use the live `model` weights |\n| `inference.duration_sec` \u002F `target_sample_rate` | output length and SR (must match model arch) |\n| `inference.cfg`, `num_steps`, `noise_scale`, `noise_shift`, `prediction_type`, `seed` | sampling hyperparameters |\n| `inference.batch_size` | rows per ODE batch |\n| `inference.trim_to_duration` | trim output to `duration_sec` |\n| `output.output_dir` | where wavs are written |\n| `output.loudness_norm`, `loudness_target_lufs` | optional `pyloudnorm` post-processing |\n\n#### CSV semantics\n\n- `video_exist=0` → uses learned empty CLIP\u002FSync tokens (no video decode)\n- `text_exist=0` → uses learned empty CLIP-text token (caption ignored)\n- Optional `id` column; otherwise the wav file name is derived from `Path(video_path).stem`, falling back to `row_\u003Cidx>` for text-only rows\n- Captions with commas must be quoted\n\n#### EMA caveat\n\nThe EMA tensor stored as `model_ema1` is updated with `ema_decay = 0.9999` per step. After only a few hundred \u002F thousand steps it still contains random-init values and produces noise during inference. Set `model.use_ema: false` (or pass an `ema_epoch_*.pth` saved after enough steps) when sampling from a short \u002F overfit run.\n\n\u003C\u002Fdetails>\n\n## Training\n\nFor feature extraction and training (single-node and multi-node), see **[TRAINING.md](TRAINING.md)**.\n\n## Citation\n\n```bibtex\n@misc{zhou2026wavflowaudiogenerationwaveform,\n      title={WavFlow: Audio Generation in Waveform Space}, \n      author={Feiyan Zhou and Luyuan Wang and Shoufa Chen and Zhe Wang and Zhiheng Liu and Yuren Cong and Xiaohui Zhang and Fanny Yang and Belinda Zeng},\n      year={2026},\n      eprint={2605.18749},\n      archivePrefix={arXiv},\n      primaryClass={cs.SD},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18749}, \n}\n```\n\n## Acknowledgements\n\nWavFlow builds on the open-source community. We gratefully acknowledge:\n\n- **[MMAudio](https:\u002F\u002Fgithub.com\u002Fhkchengrex\u002FMMAudio)** — multimodal audio generation\n- **[JiT](https:\u002F\u002Fgithub.com\u002FLTH14\u002FJiT)** — Just Image Transformer\n- **[Synchformer](https:\u002F\u002Fgithub.com\u002Fv-iashin\u002FSynchformer)** — audio-visual synchronization\n\n## License\n\nThe majority of WavFlow is licensed under [**CC-BY-NC 4.0**](LICENSE). Portions of the project are vendored from third-party open source projects under their original license terms (MIT, Apache 2.0, CC BY-NC 4.0, and Stability AI Community License). See [`NOTICE.txt`](NOTICE.txt) for the full per-component breakdown and license texts.\n","WavFlow 是一个用于从视频和文本输入生成同步高保真音频的项目，直接在原始波形空间中进行处理，无需经过潜在压缩。该项目通过波形分块化和振幅提升技术，实现了基于直接x预测的稳定流匹配。WavFlow的核心功能包括支持多模态音频生成，并且在VGGSound (VT2A) 和AudioCaps (T2A) 基准测试中表现出色，证明了端到端波形生成可以达到与传统框架相媲美的声学丰富度、保真度和同步性。此项目适合需要高质量音频生成的应用场景，如虚拟现实、游戏开发或多媒体内容创作等。","2026-06-11 03:55:58","CREATED_QUERY"]