[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71013":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":22,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":27,"discoverSource":28},71013,"ImageBind","facebookresearch\u002FImageBind","facebookresearch","ImageBind One Embedding Space to Bind Them All",null,"Python",9034,843,95,79,0,2,11,39.78,"Other",false,"main",true,[],"2026-06-12 02:02:46","# ImageBind: One Embedding Space To Bind Them All\n\n**[FAIR, Meta AI](https:\u002F\u002Fai.facebook.com\u002Fresearch\u002F)** \n\nRohit Girdhar*,\nAlaaeldin El-Nouby*,\nZhuang Liu,\nMannat Singh,\nKalyan Vasudev Alwala,\nArmand Joulin,\nIshan Misra*\n\nTo appear at CVPR 2023 (*Highlighted paper*)\n\n[[`Paper`](https:\u002F\u002Ffacebookresearch.github.io\u002FImageBind\u002Fpaper)] [[`Blog`](https:\u002F\u002Fai.facebook.com\u002Fblog\u002Fimagebind-six-modalities-binding-ai\u002F)] [[`Demo`](https:\u002F\u002Fimagebind.metademolab.com\u002F)] [[`Supplementary Video`](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fimagebind\u002Fimagebind_video.mp4)] [[`BibTex`](#citing-imagebind)]\n\nPyTorch implementation and pretrained models for ImageBind. For details, see the paper: **[ImageBind: One Embedding Space To Bind Them All](https:\u002F\u002Ffacebookresearch.github.io\u002FImageBind\u002Fpaper)**.\n\nImageBind learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.\n\n\n\n![ImageBind](https:\u002F\u002Fuser-images.githubusercontent.com\u002F8495451\u002F236859695-ffa13364-3e39-4d99-a8da-fbfab17f9a6b.gif)\n\n## ImageBind model\n\nEmergent zero-shot classification performance.\n\n\u003Ctable style=\"margin: auto\">\n  \u003Ctr>\n    \u003Cth>Model\u003C\u002Fth>\n    \u003Cth>\u003Cspan style=\"color:blue\">IN1k\u003C\u002Fspan>\u003C\u002Fth>\n    \u003Cth>\u003Cspan style=\"color:purple\">K400\u003C\u002Fspan>\u003C\u002Fth>\n    \u003Cth>\u003Cspan style=\"color:green\">NYU-D\u003C\u002Fspan>\u003C\u002Fth>\n    \u003Cth>\u003Cspan style=\"color:LightBlue\">ESC\u003C\u002Fspan>\u003C\u002Fth>\n    \u003Cth>\u003Cspan style=\"color:orange\">LLVIP\u003C\u002Fspan>\u003C\u002Fth>\n    \u003Cth>\u003Cspan style=\"color:purple\">Ego4D\u003C\u002Fspan>\u003C\u002Fth>\n    \u003Cth>download\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>imagebind_huge\u003C\u002Ftd>\n    \u003Ctd align=\"right\">77.7\u003C\u002Ftd>\n    \u003Ctd align=\"right\">50.0\u003C\u002Ftd>\n    \u003Ctd align=\"right\">54.0\u003C\u002Ftd>\n    \u003Ctd align=\"right\">66.9\u003C\u002Ftd>\n    \u003Ctd align=\"right\">63.4\u003C\u002Ftd>\n    \u003Ctd align=\"right\">25.0\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fimagebind\u002Fimagebind_huge.pth\">checkpoint\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \n\u003C\u002Ftable>\n\n## Usage\n\nInstall pytorch 2.0+ and other 3rd party dependencies.\n\n```shell\nconda create --name imagebind python=3.10 -y\nconda activate imagebind\n\npip install .\n```\n\nFor windows users, you might need to install `soundfile` for reading\u002Fwriting audio files. (Thanks @congyue1977)\n\n```\npip install soundfile\n```\n\n\nExtract and compare features across modalities (e.g. Image, Text and Audio).\n\n```python\nfrom imagebind import data\nimport torch\nfrom imagebind.models import imagebind_model\nfrom imagebind.models.imagebind_model import ModalityType\n\ntext_list=[\"A dog.\", \"A car\", \"A bird\"]\nimage_paths=[\".assets\u002Fdog_image.jpg\", \".assets\u002Fcar_image.jpg\", \".assets\u002Fbird_image.jpg\"]\naudio_paths=[\".assets\u002Fdog_audio.wav\", \".assets\u002Fcar_audio.wav\", \".assets\u002Fbird_audio.wav\"]\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n\n# Instantiate model\nmodel = imagebind_model.imagebind_huge(pretrained=True)\nmodel.eval()\nmodel.to(device)\n\n# Load data\ninputs = {\n    ModalityType.TEXT: data.load_and_transform_text(text_list, device),\n    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),\n    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),\n}\n\nwith torch.no_grad():\n    embeddings = model(inputs)\n\nprint(\n    \"Vision x Text: \",\n    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T, dim=-1),\n)\nprint(\n    \"Audio x Text: \",\n    torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1),\n)\nprint(\n    \"Vision x Audio: \",\n    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.AUDIO].T, dim=-1),\n)\n\n# Expected output:\n#\n# Vision x Text:\n# tensor([[9.9761e-01, 2.3694e-03, 1.8612e-05],\n#         [3.3836e-05, 9.9994e-01, 2.4118e-05],\n#         [4.7997e-05, 1.3496e-02, 9.8646e-01]])\n#\n# Audio x Text:\n# tensor([[1., 0., 0.],\n#         [0., 1., 0.],\n#         [0., 0., 1.]])\n#\n# Vision x Audio:\n# tensor([[0.8070, 0.1088, 0.0842],\n#         [0.1036, 0.7884, 0.1079],\n#         [0.0018, 0.0022, 0.9960]])\n\n```\n\n## Model card\nPlease see the [model card](model_card.md) for details.\n\n## License\n\nImageBind code and model weights are released under the CC-BY-NC 4.0 license. See [LICENSE](LICENSE) for additional details.\n\n## Contributing\n\nSee [contributing](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md).\n\n## Citing ImageBind\n\nIf you find this repository useful, please consider giving a star :star: and citation\n\n```\n@inproceedings{girdhar2023imagebind,\n  title={ImageBind: One Embedding Space To Bind Them All},\n  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang\nand Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},\n  booktitle={CVPR},\n  year={2023}\n}\n```\n","ImageBind 是一个由 Meta AI 研发的多模态嵌入模型，能够在一个统一的嵌入空间中处理包括图像、文本、音频、深度信息、热成像及惯性测量单元数据在内的六种不同模态的数据。该项目基于 PyTorch 实现，并提供了预训练模型，支持跨模态检索、模态组合运算、跨模态检测与生成等应用。通过学习这些不同类型的输入之间的关联性，ImageBind 能够实现零样本分类等任务上的出色表现。适用于需要整合多种类型感知数据以进行综合分析或创建新型交互体验的研究与开发场景。","2026-06-11 03:35:27","high_star"]