[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80116":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":27,"discoverSource":28},80116,"WavCube","yanghaha0908\u002FWavCube","yanghaha0908","Official code for \"WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling\"","",null,"Python",61,7,3,0,2,4,6,2.71,"MIT License",false,"master",[],"2026-06-12 02:03:58","# WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"doc\u002Fwavcube_logo.png\" alt=\"WavCube Logo\" width=\"400\"\u002F>\n\u003C\u002Fp>\n\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode-Repo-black?logo=github)](https:\u002F\u002Fgithub.com\u002Fyanghaha0908\u002FWavCube)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%93%84%20ArXiv-Paper-red.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06407)\n[![model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20WavCube-Models-blueviolet)](https:\u002F\u002Fhuggingface.co\u002Fyhaha\u002FWavCube)\n\n\nWavCube is a 128-dim, 50Hz continuous representation that unifies speech understanding,\nreconstruction, and generation within a single space.\nThis is the official code for the paper [WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.06407) [[abs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06407)].\n\n## ✨ Key Features\n- **Unified Speech Representation** – A single continuous latent space that simultaneously supports speech understanding, reconstruction, and generation.\n- **Semantic-Acoustic Joint Modeling** – Harmonizes high-level semantic structures with low-level acoustic textures.\n- **Compact & Diffusion-Friendly** – Features a compact 128-dimensional bottleneck (8x compression from standard SSL features) enabling easier diffusion modeling.\n\u003C!-- By infusing fine-grained acoustic details into a distilled SSL semantic manifold, -->\n\n\n\n## 🛠️ Installation\n\nWe recommend creating a fresh conda environment for installation. \n### Env Setup\n```bash\nconda create -n WavCube python=3.10 -y\nconda activate WavCube\n```\n\n### Basic Requirements\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyanghaha0908\u002FWavCube.git\ncd WavCube\npip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126\nconda install -c conda-forge sox ffmpeg libsndfile\npip install -e \".[train]\"\n```\n\n## 🚀 Quick Start\n\n### Checkpoint Download\nPre-trained model checkpoints are available. Please use the following links to download the checkpoints:\n\n| Representation | Dimension | Sample Rate | Frame Rate |\n|----------------|-----------|-------------|------------|\n| 🤗 [WavCube](https:\u002F\u002Fhuggingface.co\u002Fyhaha\u002FWavCube\u002Ftree\u002Fmain\u002FWavCube) | 128 | 16k Hz | 50 Hz |\n| 🤗 [WavCube-pro](https:\u002F\u002Fhuggingface.co\u002Fyhaha\u002FWavCube\u002Ftree\u002Fmain\u002FWavCube-Pro) | 128 | 16k Hz | 50 Hz |\n\n\n### Extract Representation from Speech\nYou can get continuous representations from raw wav using the following code:\n\n```bash\npython wav_to_feature.py \\\n    --audio 19_198_000000_000002.wav \\\n    --config configs\u002FWavCube-stage2.yaml \\\n    --ckpt WavCube\u002Fcheckpoints\u002Fvocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt \\\n    --output 19_198_000000_000002.pt\n```\n\n### Reconstruct Speech from Representation\n\nYou can reconstruct waveform from representations using the following code:\n\n```bash\npython feature_to_wav.py \\\n    --feature 19_198_000000_000002.pt \\\n    --config configs\u002FWavCube-stage2.yaml \\\n    --ckpt WavCube\u002Fcheckpoints\u002Fvocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt\n```\n\n\u003C!-- ## 💡 Tips\n- For devices that do not support BF16, you can manually disable PyTorch's mixed precision manager.\n- If you encounter any issues or have questions, please feel free to open an issue. -->\n\n## 🔧 Training\n\nWavCube employs a **two-stage training** pipeline, all scripts are located in `scripts\u002Ftrain\u002F`.\n\n```bash\n# ----------------- WavCube -----------------\nbash scripts\u002Ftrain\u002Ftrain_WavCube_stage1.sh\nbash scripts\u002Ftrain\u002Ftrain_WavCube_stage2.sh\n\n# --------------- WavCube-Pro ---------------\nbash scripts\u002Ftrain\u002Ftrain_WavCube_pro_stage1.sh\nbash scripts\u002Ftrain\u002Ftrain_WavCube_pro_stage2.sh\n# Note: Update `stage1_ckpt_path` in config to your Stage 1 checkpoint before running.\n```\n\n## 🤝 Additional Resources\n\n### Evaluation Checkpoints\n\nTo make it easier to reproduce our results, we have uploaded supplementary resources to our 🤗 [WavCube](https:\u002F\u002Fhuggingface.co\u002Fyhaha\u002FWavCube\u002Ftree\u002Fmain\u002Fckpts). These include the `wavlm-large` weights and the necessary evaluation checkpoints for computing metrics such as WER, Speaker Similarity, and UTMOS.\n\n```bash\n# For offline testing or if you experience network issues, you can manually copy the checkpoints to your local cache:\ncp -r ckpts\u002Fhub ~\u002F.cache\u002Ftorch\u002F\ncp ckpts\u002Futmos22_strong_step7459_v1.pt ~\u002F.cache\u002Ftorch\u002Fhub\u002Fcheckpoints\u002F \ncp -r ckpts\u002Fs3prl ~\u002F.cache\n```\n\n### Data Preparation\n\n**Small-scale data** — uses `VocosDataModule`. Prepare a filelist of audio paths for training and validation:\n\n```bash\nfind $TRAIN_DATASET_DIR -name \"*.wav\" > filelist.train\nfind $VAL_DATASET_DIR -name \"*.wav\" > filelist.val\n```\n\nEach line is a plain audio path, for example:\n```\n\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F672\u002F122797\u002F672-122797-0026.flac\n\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F672\u002F122797\u002F672-122797-0071.flac\n\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F672\u002F122797\u002F672-122797-0037.flac\n```\n\n**Large-scale data** — uses `VocosEmiliaDataModule`. Two files are required:\n\n1. **Filelist** — same format as above for LibriSpeech; for LibriHeavy, each line is a JSON entry, for example:\n```json\n{\"id\": \"medium\u002F968\u002F...\u002Fvoyagesdolittle_55_lofting_64kb_38\", \"start\": 22.32, \"duration\": 19.36, \"channel\": 0, \"recording\": {\"sources\": [{\"source\": \"download\u002Flibrilight\u002Fmedium\u002F968\u002F...\u002Fvoyagesdolittle_55_lofting_64kb.flac\"}], \"sampling_rate\": 16000}, \"type\": \"MonoCut\"}\n```\n\n2. **Index file** (`.idx`) — a byte-offset index for fast random access, generated via:\n```bash\npython data\u002Fgenerate_idx.py\n```\n\nExample data manifest files for both formats are provided in the `data\u002F` directory for reference.\n\n\n## ❤️ Acknowledgements\n\nWe sincerely thank the authors of the following open-source projects, whose excellent work laid the foundation for WavCube: [Semantic-VAE](https:\u002F\u002Fgithub.com\u002FZhikangNiu\u002FSemantic-VAE), [F5-TTS](https:\u002F\u002Fgithub.com\u002Fswivid\u002Ff5-tts), [Vocos](https:\u002F\u002Fgithub.com\u002Fgemelo-ai\u002Fvocos), [MiMo-Audio-Tokenizer](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo-Audio-Tokenizer), [s3prl](https:\u002F\u002Fgithub.com\u002Fs3prl\u002Fs3prl).\n\n\n\n## 📝 Citation\n\nIf you find this repo helpful, please cite our work:\n\n```bibtex\n@article{yang2026wavcube,\n  title={WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling},\n  author={Yang, Guanrou and Tan, Tian and Chen, Qian and Niu, Zhikang and Song, Yakun and Ma, Ziyang and Chen, Yushen and Xie, Zeyu and Wang, Tianrui and Yang, Yifan and others},\n  journal={arXiv preprint arXiv:2605.06407},\n  year={2026}\n}\n```\n\n## 📄 License\n\nThe code in this repository is released under the MIT license, see [LICENSE](LICENSE) for details.\n","WavCube 是一个通过语义-声学联合建模来统一语音理解、重建和生成的项目。它提供了一个128维、50Hz连续表示的空间，能够同时支持语音信号的理解、重构与生成任务。该项目的核心特点包括统一的语音表示、语义-声学联合建模以及紧凑且易于扩散模型处理的特性（相比标准SSL特征压缩了8倍）。WavCube适合于需要高效处理语音数据的应用场景，如自动语音识别、文本转语音合成等，特别适用于希望在单一框架内实现多种语音处理功能的研究者和开发者。","2026-06-11 03:59:19","CREATED_QUERY"]