[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80698":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":12,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":13,"stars30d":13,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":14,"rankGlobal":9,"rankLanguage":9,"license":15,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":16,"topics":19,"createdAt":9,"pushedAt":9,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":13,"starSnapshotCount":13,"syncStatus":12,"lastSyncTime":23,"discoverSource":24},80698,"wvs-code","rakanWen\u002Fwvs-code","rakanWen","Code for When Vision Speaks for Sound",null,"Python",45,2,0,41.43,"Apache License 2.0",false,"main",true,[],"2026-06-11 04:07:14","\n## 📘 Official Codebase\n\nThis is the official code repository for paper **[When Vision Speaks for Sound](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.16403)**.\n\nIt provides the code, model release, and evaluation interface for **Thud**, an intervention-driven diagnostic framework for probing whether video-capable multimodal models truly verify audio or rely on visual-semantic shortcuts.\n\n---\n### ⚙️ Environment Setup\n\nInstall the Python dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\nSome system-level dependencies are not included in `requirements.txt`.  \nFor video\u002Faudio processing and DeepSpeed compilation, please also make sure that `ffmpeg`, CUDA toolkit \u002F `nvcc`, and the required NVIDIA libraries are available in your environment.\n\nWe use **[LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory)** for SFT and DPO training. Please install LLaMA-Factory separately following its official instructions, or clone it manually:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory.git\ncd LLaMA-Factory\npip install -e .\npip install -r requirements\u002Fmetrics.txt\n```\n\n\n### 🔧 Training with LLaMA-Factory\n\nTo reproduce or adapt the training process, please first register the corresponding datasets in:\n\n```bash\nLLaMA-Factory\u002Fdata\u002Fdata_info.json\n```\n\n---\n\n#### SFT Data Format\n\nThe SFT data follows the ShareGPT-style multimodal format. Each example contains a `messages` field, together with the corresponding video and audio paths:\n\n```json\n{\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"\u003Cvideo>\u003Caudio>Is there any noticeable audio delay or temporal manipulation in this clip?\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"The moment a child running with a blanket over their head collides with a pile of toys and falls lines up well with the thud and clatter of plastic toys, so this clip appears synchronized overall.\"\n    }\n  ],\n  \"videos\": [\n    \"\u002Fpath\u002Fto\u002Fvideo.mp4\"\n  ],\n  \"audios\": [\n    \"\u002Fpath\u002Fto\u002Faudio.wav\"\n  ]\n}\n```\n\nThe corresponding entry in `data_info.json` can be registered as:\n\n```json\n{\n  \"your_sft_dataset_name\": {\n    \"file_name\": \"your_sft_dataset.json\",\n    \"formatting\": \"sharegpt\",\n    \"columns\": {\n      \"messages\": \"messages\",\n      \"videos\": \"videos\",\n      \"audios\": \"audios\"\n    }\n  }\n}\n```\n\n---\n\n#### DPO Data Format\n\nThe DPO data contains a user prompt, a chosen response, a rejected response, and the corresponding video\u002Faudio paths:\n\n```json\n{\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"\u003Cvideo>\u003Caudio>What visual is displayed as the song's instrumental fades out at the very end of the video?\\nA. The artist's name 'ZAK DOWNTOWN'\\nB. A black screen with the word 'MOODY'\\nC. A globe logo with the text 'Downtown Worldwide'\\nD. A red background with lightning\\nPlease provide your answer by stating the letter followed by the full option, with a brief explanation grounded in the audio and visual cues.\"\n    }\n  ],\n  \"chosen\": {\n    \"role\": \"assistant\",\n    \"content\": \"In the video, From 2:44 onwards, the vocal track ends and the instrumental beat slowly fades out into silence; at the same moment in the visual, During this audio fade-out, a white globe logo with the text 'Downtown Worldwide' is displayed on a dark background. Therefore the correct answer is C. A globe logo with the text 'Downtown Worldwide'.\"\n  },\n  \"rejected\": {\n    \"role\": \"assistant\",\n    \"content\": \"In the video, From 2:44 onwards, the vocal track ends and the instrumental beat slowly fades out into silence; at the same moment in the visual, During this audio fade-out, a white globe logo with the text 'Downtown Worldwide' is displayed on a dark background. Based on this, the answer is B. A black screen with the word 'MOODY'.\"\n  },\n  \"videos\": [\n    \"\u002Fpath\u002Fto\u002Fvideo.mp4\"\n  ],\n  \"audios\": [\n    \"\u002Fpath\u002Fto\u002Faudio.wav\"\n  ]\n}\n```\n\nThe corresponding entry in `data_info.json` can be registered as:\n\n```json\n{\n  \"your_dpo_dataset_name\": {\n    \"file_name\": \"your_dpo_dataset.json\",\n    \"formatting\": \"sharegpt\",\n    \"ranking\": true,\n    \"columns\": {\n      \"messages\": \"messages\",\n      \"chosen\": \"chosen\",\n      \"rejected\": \"rejected\",\n      \"videos\": \"videos\",\n      \"audios\": \"audios\"\n    }\n  }\n}\n```\n\nPlease modify the dataset names, file paths, and column mappings according to your local setup.\n\n---\n\n#### Training Stages\n\nAfter registering the datasets, SFT and DPO can be launched using the standard LLaMA-Factory training interface. The exact command should be adjusted according to your hardware configuration, GPU memory, model size, and distributed training strategy.\n\nOur training consists of two stages:\n\n1. **Supervised Fine-Tuning (SFT)**  \n   We first perform SFT to warm up the model on intervention-derived and audio-visual grounding data.\n\n2. **Direct Preference Optimization (DPO)**  \n   We then apply DPO using preference pairs that encourage audio-verified responses over visually plausible shortcut responses.\n\nFor the detailed hyperparameters used in our experiments, including learning rate, batch size, cutoff length, LoRA settings, DeepSpeed configuration, and training schedule, please refer to **Appendix C** in our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.16403).\n\n---\n\n### 🤗 Model Weights\n\nThe trained model checkpoint is available on Hugging Face:\n\n**[wvs-thud-model](https:\u002F\u002Fhuggingface.co\u002FRakancorle1\u002Fwvs-thud-model)**\n\n---\n\n### 📁 Evaluation Data\n\nThe evaluation datasets and benchmark files used in THUD are currently being organized and will be released soon.\n\n","该项目提供了论文《当视觉为声音代言》的官方代码库，旨在通过Thud框架探究多模态模型是否真正验证音频或依赖于视觉-语义捷径。核心功能包括一套基于干预驱动的诊断工具，支持视频和音频处理，并集成了LLaMA-Factory进行SFT（监督微调）和DPO（决策偏好优化）训练。技术上，项目使用Python编写，依赖于FFmpeg、CUDA等系统级组件及DeepSpeed加速库。适合用于研究多模态AI模型如何处理视听信息同步问题，特别是在需要区分真实音频验证与仅依赖视觉线索的应用场景中。","2026-06-11 04:01:41","CREATED_QUERY"]