[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79257":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},79257,"audio-vis-align","demtmeder\u002Faudio-vis-align","demtmeder","Training and evaluation toolkit for audio-visual contrastive representation alignment (CLIP-style, but for audio + video).",null,"Python",223,7139,6,0,149,10,"Other",false,"main",[],"2026-06-12 02:03:50","# Audio-Vis-Align: A Toolkit for Audio-Visual Representation Alignment\n\n> Two encoders, one shared space — for video and audio.\n\n![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%2B-blue?style=flat-square)\n![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-green?style=flat-square)\n![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpytorch-2.0%2B-ee4c2c?style=flat-square)\n![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2025.04217-b31b1b?style=flat-square)\n\n---\n\n## Overview\n\n**Audio-Vis-Align** (`ava`) is a research toolkit for training and evaluating\ncontrastive audio-visual representations. The recipe is intentionally close to\nCLIP: two unimodal encoders are jointly trained with a symmetric InfoNCE loss so\nthat the embeddings of an audio clip and the synchronous video clip land near\neach other in a shared space, while embeddings of unrelated pairs are pushed\napart.\n\nThe toolkit grew out of experiments started in late 2023 for a course project at\nSJTU and has since expanded to cover the full pipeline used in our 2025 paper:\ndata preparation, distributed training with DDP, EMA, cosine schedules,\nretrieval \u002F zero-shot \u002F linear-probe evaluation, and several loss variants\nincluding hard-negative-aware InfoNCE.\n\nThe library aims to be **small** (under ~3k lines), **readable** (no\nabstractions you cannot inline in your head), and **honest** about what works:\nwe ship the exact configs we used for AudioSet pretraining and VGGSound\nfinetuning, with the result tables below reproduced by the included scripts.\n\n---\n\n## Architecture\n\n```\n                +---------------------+         +---------------------+\n   waveform --> |   Audio encoder     |  --->   |   Audio projection  | --+\n                |  (log-mel + Tx)     |         |    (MLP, L2-norm)   |   |\n                +---------------------+         +---------------------+   |\n                                                                          v\n                                                                 +-----------------+\n                                                                 |  shared latent  |\n                                                                 |     space       |\n                                                                 +-----------------+\n                                                                          ^\n                +---------------------+         +---------------------+   |\n   video    --> |  Visual encoder     |  --->   |  Visual projection  | --+\n                | (3D-patch ViT)      |         |   (MLP, L2-norm)    |\n                +---------------------+         +---------------------+\n\n                                  symmetric InfoNCE\n                              with temperature 1\u002Ftau\n```\n\nThe two towers are independent at training time and at inference time. A typical\ntraining mini-batch consists of `B` paired clips; each modality is embedded\nindependently, projected with a 2-layer MLP, L2-normalized, and the loss is the\nmean of the audio->video and video->audio cross-entropy on the `B x B`\nsimilarity matrix.\n\n---\n\n## Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdemtmeder\u002Faudio-vis-align.git\ncd audio-vis-align\npip install -e .\n```\n\nThe `pip install -e .` step will pull in `torch`, `torchaudio`,\n`torchvision`, `transformers`, `einops`, `av` (PyAV), and `webdataset`. If your\ntorch wheel is already installed (e.g. inside an HPC environment) it will be\nrespected.\n\n> **Note:** decoding video on the fly requires a working FFmpeg + PyAV install.\n> On Ubuntu, `apt install ffmpeg libavformat-dev libavcodec-dev` is usually\n> sufficient.\n\n---\n\n## Quick start\n\n### 1. Wrap a folder of clips\n\n```python\nfrom ava.data import AVClipDataset\nfrom torch.utils.data import DataLoader\n\nds = AVClipDataset(files=[\"clip1.mp4\", \"clip2.mp4\"], clip_seconds=4.0)\nloader = DataLoader(ds, batch_size=2)\nbatch = next(iter(loader))\nprint(batch[\"wav\"].shape, batch[\"video\"].shape)\n```\n\n### 2. Encode a batch\n\n```python\nimport torch\nfrom ava.models import AVModel\n\nmodel = AVModel(embed_dim=512).eval()\nwith torch.no_grad():\n    za, zv, scale = model(batch[\"wav\"], batch[\"video\"])\nprint(za.shape, zv.shape, scale.item())\n```\n\n### 3. Compute retrieval metrics\n\n```python\nfrom ava.eval import cross_modal_retrieval\nmetrics = cross_modal_retrieval(za, zv)\nprint(metrics[\"a2v\"])  # {'R@1': ..., 'R@5': ..., 'R@10': ..., 'MedR': ...}\n```\n\n---\n\n## Training\n\nThe supported launcher is `torchrun`:\n\n```bash\nNGPUS=4 bash scripts\u002Ftrain.sh configs\u002Faudioset_pretrain.yaml\n```\n\nwhich expands to:\n\n```bash\ntorchrun --standalone --nproc_per_node 4 \\\n  -m ava.cli train --config configs\u002Faudioset_pretrain.yaml\n```\n\nDDP is initialised lazily by `ava.training.distributed.init_distributed`,\nwhich reads `LOCAL_RANK`, `RANK`, and `WORLD_SIZE` from the environment as\npopulated by `torchrun`. Single-GPU runs simply skip the init step.\n\nMixed precision (bf16) is on the roadmap; today the code paths assume fp32 or\nfp16 with the standard `GradScaler`.\n\n---\n\n## Data\n\n| Dataset    | Modality   | Hours    | Used for                  | License      |\n|------------|------------|----------|---------------------------|--------------|\n| AudioSet   | A + V      | ~5800    | large-scale pretraining   | CC-BY 4.0    |\n| VGGSound   | A + V      | ~550     | finetuning & evaluation   | CC-BY 4.0    |\n| ACAV-100M  | A + V      | ~100k    | optional extra pretraining| research-use |\n\nWe ship `scripts\u002Fprepare_audioset.py` which converts a directory of `.mp4` files\ninto webdataset-style `.tar` shards. The training loader (`ava.data.make_loader`)\nthen takes glob-style URLs and streams from those shards. We recommend ~500\nsamples per shard for balanced shuffling.\n\nFor initial debugging we recommend the small-ablation config below, which\nfinishes a full epoch on a single A100 in under 10 minutes.\n\n---\n\n## Evaluation\n\n```bash\nbash scripts\u002Feval_retrieval.sh configs\u002Fbase.yaml checkpoints\u002Fbest.pt\n```\n\nThe script computes A->V and V->A retrieval (R@1\u002F5\u002F10, MedR) on the held-out\nsplit defined in the YAML. Zero-shot classification and linear probing are\nexposed via `ava.eval.zero_shot` and `ava.eval.probing` respectively — see\n`docs\u002Fevaluation.md` for the protocol details.\n\n---\n\n## Results\n\nRetrieval on AudioSet-eval (10k held-out clips):\n\n| Model         | A->V R@1 | A->V R@5 | A->V R@10 | V->A R@1 | V->A R@5 | V->A R@10 |\n|---------------|----------|----------|-----------|----------|----------|-----------|\n| AVA-small     | 14.2     | 32.7     |  43.1     | 13.8     | 31.9     |  42.4     |\n| AVA-base      | **19.6** | **41.5** | **52.0**  | **18.9** | **40.8** | **51.3**  |\n\nRetrieval on VGGSound-test (~15k clips):\n\n| Model         | A->V R@1 | A->V R@5 | A->V R@10 | V->A R@1 | V->A R@5 | V->A R@10 |\n|---------------|----------|----------|-----------|----------|----------|-----------|\n| AVA-small     | 21.4     | 44.9     |  56.7     | 20.6     | 44.0     |  55.8     |\n| AVA-base      | **27.8** | **53.6** | **65.4**  | **27.0** | **52.7** | **64.5**  |\n\nNumbers from a 4xA100 run with the configs in `configs\u002F`. Variance across seeds\nis ±0.3 R@1 for `AVA-small` and ±0.2 for `AVA-base`.\n\n---\n\n## Configuration\n\nA minimal config (`configs\u002Fbase.yaml`):\n\n```yaml\nmodel:\n  embed_dim: 512        # output dim of both projection heads\n  init_temp: 0.07       # initial 1\u002Ftau for InfoNCE\n\ndata:\n  sample_rate: 16000\n  clip_seconds: 4.0\n  n_frames: 8           # 8 frames per ~4 s clip\n  image_size: 224\n\ntrain:\n  epochs: 30\n  batch_size: 64        # per GPU\n  lr: 1.0e-4            # AdamW base LR\n  warmup_steps: 1000    # linear warmup then cosine decay\n  use_ema: true         # track 0.999 EMA of weights\n```\n\nConfigs may inherit via `inherit: \u003Cother.yaml>`; only the deltas need to be\nspecified. See `configs\u002Fvggsound_finetune.yaml` for an example.\n\n---\n\n## Reproducing paper results\n\n```bash\n# 1. Build shards once\npython scripts\u002Fprepare_audioset.py --root \u002Fdata\u002Faudioset --out data\u002Faudioset\u002Fshards\u002F%06d.tar\n\n# 2. Pretrain on AudioSet (4xA100, ~36h)\nNGPUS=4 bash scripts\u002Ftrain.sh configs\u002Faudioset_pretrain.yaml\n\n# 3. Finetune on VGGSound (4xA100, ~8h)\nNGPUS=4 bash scripts\u002Ftrain.sh configs\u002Fvggsound_finetune.yaml\n\n# 4. Evaluate\nbash scripts\u002Feval_retrieval.sh configs\u002Fvggsound_finetune.yaml runs\u002Fvggsound\u002Fep019.pt\n```\n\nAll hyperparameters reported in the paper are exactly those committed to the\nconfigs above; no hidden flags.\n\n---\n\n## Citation\n\n```bibtex\n@article{lin2025ava,\n  title   = {Audio-Vis-Align: Symmetric Contrastive Alignment of Audio\n             and Visual Representations},\n  author  = {Lin, Yixuan and others},\n  journal = {arXiv preprint arXiv:2025.04217},\n  year    = {2025},\n}\n```\n\n---\n\n## Acknowledgments\n\nThanks to the SJTU MoE Key Lab for compute, and to the maintainers of PyAV,\nwebdataset, and the wider PyTorch ecosystem. This work was partially supported\nby the SJTU Multimodal AI initiative. The framing and several engineering\nchoices were inspired by CLIP (Radford et al., 2021), AudioCLIP\n(Guzhov et al., 2022), and CAV-MAE (Gong et al., 2023).\n\n---\n\n## License\n\nApache License 2.0 — see [LICENSE](LICENSE) for details.\n","Audio-Vis-Align 是一个用于音频-视觉对比表示对齐的训练和评估工具包，旨在通过联合训练两个单模态编码器，使同步的音频和视频片段在共享空间中的嵌入向量接近，而不同步的片段则被推开。项目基于 Python 语言开发，使用 PyTorch 框架，并采用类似 CLIP 的方法实现对齐。其核心功能包括数据准备、分布式训练（支持 DDP）、EMA 和余弦调度等高级特性，以及多种评估模式如检索、零样本分类和线性探针评估。此外，它还提供了几种损失函数变体，包括考虑难负样本的 InfoNCE 损失。该工具包适用于需要研究或应用跨模态学习的研究人员和开发者，特别是在多媒体内容分析、音视频理解等领域。",2,"2026-06-01 03:48:15","CREATED_QUERY"]