[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72279":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":29,"discoverSource":30},72279,"sam-audio","facebookresearch\u002Fsam-audio","facebookresearch","The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.","",null,"Python",3526,319,28,43,0,2,10,30,6,29.52,"Other",false,"main",[],"2026-06-12 02:03:01","\u003Cdiv align=\"center\">\n\n# SAM-Audio\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2512.18099-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.18099)\n![CI](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsam-audio\u002Factions\u002Fworkflows\u002Fci.yaml\u002Fbadge.svg)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Collection-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Ffacebook\u002Fsam-audio)\n\n![model_image](assets\u002Fsam_audio_main_model.png)\n\n\u003C\u002Fdiv>\n\nSegment Anything Model for Audio [[**Blog**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fsam-audio\u002F)] [[**Paper**](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fsam-audio-segment-anything-in-audio\u002F)] [[**Demo**](https:\u002F\u002Faidemos.meta.com\u002Fsegment-anything\u002Feditor\u002Fsegment-audio)]\n\nSAM-Audio is a foundation model for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans.\n\nSAM-Audio and the Judge model crucially rely on [Perception-Encoder Audio-Visual (PE-AV)](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fpe-av-large), which you can read more about [here](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fpushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning\u002F)\n\n## Setup\n\n**Requirements:**\n- Python >= 3.11\n- CUDA-compatible GPU (recommended)\n\nInstall dependencies:\n\n```bash\npip install .\n```\n\n## Usage\n\n⚠️ Before using SAM Audio, please request access to the checkpoints on the SAM Audio\nHugging Face [repo](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-large). Once accepted, you\nneed to be authenticated to download the checkpoints. You can do this by running\nthe following [steps](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhuggingface_hub\u002Fen\u002Fquick-start#authentication)\n(e.g. `hf auth login` after generating an access token.)\n\n### Basic Text Prompting\n\n```python\nfrom sam_audio import SAMAudio, SAMAudioProcessor\nimport torchaudio\nimport torch\n\nmodel = SAMAudio.from_pretrained(\"facebook\u002Fsam-audio-large\")\nprocessor = SAMAudioProcessor.from_pretrained(\"facebook\u002Fsam-audio-large\")\nmodel = model.eval().cuda()\n\nfile = \"\u003Caudio file>\" # audio file path or torch tensor\ndescription = \"\u003Cdescription>\"\n\nbatch = processor(\n    audios=[file],\n    descriptions=[description],\n).to(\"cuda\")\n\nwith torch.inference_mode():\n    # NOTE: `predict_spans` and `reranking_candidates` have a large impact on performance.\n    # Setting `predict_span=True` and `reranking_candidates=8` will give you better results at the cost of\n    # latency and memory. See the \"Span Prediction\" section below for more details\n   result = model.separate(batch, predict_spans=False, reranking_candidates=1)\n\n# Save separated audio\nsample_rate = processor.audio_sampling_rate\ntorchaudio.save(\"target.wav\", result.target.cpu(), sample_rate)      # The isolated sound\ntorchaudio.save(\"residual.wav\", result.residual.cpu(), sample_rate)  # Everything else\n```\n\n### Prompting Methods\n\nSAM-Audio supports three types of prompts:\n\n1. **Text Prompting**: Describe the sound you want to isolate using natural language. To match training, please use lowercase noun-phrase\u002Fverb-phrase (NP\u002FVP) format for text (for example instead of \"Thunder can be heard in the background\" use \"thunder\").\n   ```python\n   processor(audios=[audio], descriptions=[\"man speaking\"])\n   ```\n\n2. **Visual Prompting**: Use video frames and masks to isolate sounds associated with visual objects\n   ```python\n   processor(audios=[video], descriptions=[\"\"], masked_videos=processor.mask_videos([frames], [mask]))\n   ```\n\n3. **Span Prompting**: Specify time ranges where the target sound occurs\n   ```python\n   processor(audios=[audio], descriptions=[\"car honking\"], anchors=[[[\"+\", 6.3, 7.0]]])\n   ```\n\nSee the [examples](examples) directory for more detailed examples\n\n### Span Prediction (Optional for Text Prompting)\n\nWe also provide support for automatically predicting the spans based on the text description, which is especially helpful for separating non-ambience sound events.  You can enable this by adding `predict_spans=True` in your call to `separate`\n\n```python\nwith torch.inference_mode()\n   outputs = model.separate(batch, predict_spans=True)\n\n# To further improve performance (at the expense of latency), you can add candidate re-ranking\nwith torch.inference_mode():\n   outputs = model.separate(batch, predict_spans=True, reranking_candidates=8)\n```\n\n### Re-Ranking\n\nWe provide the following models to assess the quality of the separated audio:\n\n- [CLAP](https:\u002F\u002Fgithub.com\u002FLAION-AI\u002FCLAP): measures the similarity between the target audio and text description\n- [Judge](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-judge): measures the overall separation quality across 3 axes: precision, recall, and faithfulness (see the [model card](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-judge#output-format) for more details)\n- [ImageBind](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FImageBind): for visual prompting, we measure the imagebind embedding similarity between the separated audio and the masked input video\n\nWe provide support for generating multiple candidates (by setting `reranking_candidates=\u003Ck>` in your call to `separate`), which will generate `k` audios, and choose the best one based on the ranking models mentioned above\n\n# Models\n\nBelow is a table of each of the models we released along with their overall subjective evaluation scores\n\n| Model    | General SFX | Speech | Speaker | Music | Instr(wild) | Instr(pro) |\n|----------|-------------|--------|---------|-------|-------------|------------|\n| [`sam-audio-small`](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-small) | 3.62        | 3.99   | 3.12    | 4.11  | 3.56        | 4.24       |\n| [`sam-audio-base`](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-base)   | 3.28        | 4.25   | 3.57    | 3.87  | 3.66        | 4.27       |\n| [`sam-audio-large`](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-large) | 3.50        | 4.03   | 3.60    | 4.22  | 3.66        | 4.49       |\n\nWe additional release another variant (in each size) that is better specifically on correctness of target sound as well as visual prompting:\n- [`sam-audio-small-tv`](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-small-tv)\n- [`sam-audio-base-tv`](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-base-tv)\n- [`sam-audio-large-tv`](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam-audio-large-tv)\n\n## Evaluation\n\nSee the [eval](eval) directory for instructions and scripts to reproduce results from the paper\n\n## Contributing\n\nSee [contributing](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md) for more information.\n\n## License\n\nThis project is licensed under the SAM License - see the [LICENSE](LICENSE) file for details.\n\n## Citing SAM Audio\n\nIf you use SAM Audio in your research, please use the following BibTex entry:\n\n```bibtex\n@article{shi2025samaudio,\n    title={SAM Audio: Segment Anything in Audio},\n    author={Bowen Shi and Andros Tjandra and John Hoffman and Helin Wang and Yi-Chiao Wu and Luya Gao and Julius Richter and Matt Le and Apoorv Vyas and Sanyuan Chen and Christoph Feichtenhofer and Piotr Doll{\\'a}r and Wei-Ning Hsu and Ann Lee},\n    year={2025},\n    url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.18099}\n}\n```\n","SAM-Audio 是一个用于从复杂音频混合中分离特定声音的基础模型，支持文本、视觉或时间提示。其核心功能包括基于自然语言描述、视频中的视觉线索或时间段来精确分离音频片段。该模型依赖于感知编码器音频-视觉（PE-AV）技术，能够实现高质量的音频分割。适用于需要精准音频处理的应用场景，如音频编辑、语音识别优化以及多媒体内容分析等。项目使用Python编写，推荐在CUDA兼容GPU上运行以获得最佳性能。","2026-06-11 03:41:10","high_star"]