[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79194":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},79194,"uni-mm-trainer","bandyah\u002Funi-mm-trainer","bandyah","A small library for training multimodal LLMs combining text, vision, and audio",null,"Python",224,6223,7,0,191,10,"Other",false,"main",[],"2026-06-12 02:03:49","# UniMM-Trainer\n> One training loop for text + vision + audio. Pick two of three.\n\n## Overview\n\nUniMM-Trainer is a small, opinionated library for training multimodal large models that combine **at least two of {text, vision, audio}**. It is the result of repeatedly rebuilding the same training loop across different lab projects and getting tired of forking other people's repos to remove half their assumptions.\n\nThe library makes three things easy and stays out of the way for everything else:\n\n1. **Composing encoders.** Plug a frozen audio encoder (Whisper, HuBERT, BEATs) and\u002For a frozen vision encoder (CLIP, SigLIP, DINOv2) into a language backbone (Llama, Qwen, Mistral) with one config block per modality.\n2. **Training the projection layer** (linear, Q-Former, perceiver-resampler, AnyToText-style adapter) without rebuilding your data pipeline.\n3. **Tracking real progress** during long runs — gradient norms, attention entropy, modality-balanced loss, with sensible defaults.\n\nThe library does **not** try to be a foundation model release or a serving framework. If you want to ship something to prod, this is the wrong tool.\n\n## Architecture\n\n```\n            ┌────────────────────────────────────────────────┐\n            │             UniMM-Trainer (orchestration)      │\n            └─────┬────────────────┬──────────────┬──────────┘\n                  │                │              │\n        ┌─────────▼─────┐  ┌───────▼──────┐  ┌────▼────────┐\n        │  Vision enc.  │  │  Audio enc.  │  │   Text LM   │\n        │  (frozen)     │  │  (frozen)    │  │ (LoRA\u002Ffull) │\n        └─────────┬─────┘  └───────┬──────┘  └────┬────────┘\n                  │                │              │\n                  └────►   Projection adapters    ◄───────┘\n                            (trained)\n```\n\nEach modality has its own adapter family. Encoders are loaded with HuggingFace `AutoModel` (or a custom loader for things like CLIP that don't fit the AutoModel API cleanly). The language model is the only thing that touches gradients by default, and even then most of the time you'll want to freeze it and only train adapters + LoRA.\n\n## Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fbandyah\u002Funi-mm-trainer.git\ncd uni-mm-trainer\npip install -e \".[train]\"\n```\n\nOptional dependencies:\n\n```bash\npip install -e \".[audio]\"     # adds whisper + hubert dependencies\npip install -e \".[vision]\"    # adds open-clip + siglip dependencies\npip install -e \".[all]\"       # everything\n```\n\n## Quick Start\n\nTrain a vision-language model on a small captioning dataset:\n\n```bash\nunimm train configs\u002Fvl_siglip_qwen.yaml --data \u002Fscratch\u002Fcoco_captions\u002F --out runs\u002Fvl_demo\n```\n\nOr programmatically:\n\n```python\nfrom unimm import Trainer, load_config\n\ncfg = load_config(\"configs\u002Fvl_siglip_qwen.yaml\")\ncfg.data.root = \"\u002Fscratch\u002Fcoco_captions\u002F\"\ntrainer = Trainer.from_config(cfg)\ntrainer.fit()\n```\n\nA config looks like:\n\n```yaml\nbackbone:\n  hf_id: Qwen\u002FQwen2.5-1.5B\n  lora_rank: 16\n\nmodalities:\n  vision:\n    encoder:\n      hf_id: google\u002Fsiglip-base-patch16-224\n      freeze: true\n    adapter:\n      type: qformer\n      n_queries: 32\n      n_layers: 6\n\n  audio: null  # disabled in this run\n\ndata:\n  dataset: webdataset\n  root: \u002Fscratch\u002Fcoco_captions\n  batch_size: 32\n  num_workers: 8\n\ntrain:\n  max_steps: 100000\n  lr: 2.0e-4\n  warmup_steps: 2000\n  save_every: 5000\n```\n\n## Supported pieces\n\n| Slot         | Choices                                                       |\n|--------------|---------------------------------------------------------------|\n| LM backbone  | Llama-2, Llama-3, Qwen2\u002F2.5, Mistral, Gemma                   |\n| Vision enc.  | CLIP ViT-B\u002F16, SigLIP, DINOv2, OpenCLIP                       |\n| Audio enc.   | Whisper (small\u002Fmedium), HuBERT, WavLM, BEATs                  |\n| Adapter      | linear, MLP, Q-Former, Perceiver-resampler                    |\n| Tuning       | full FT, LoRA, DoRA, frozen-backbone                          |\n\n## Quick Examples\n\nTrain an audio-language model with frozen Whisper + LoRA on Qwen:\n\n```bash\nunimm train configs\u002Fal_whisper_qwen.yaml --data \u002Fscratch\u002Faudiocaps\u002F\n```\n\nThree-modality (V+A+T):\n\n```bash\nunimm train configs\u002Fvat_dual_adapter.yaml --data \u002Fscratch\u002Favs\u002F\n```\n\nEvaluation on standard VL benchmarks:\n\n```bash\nunimm eval --ckpt runs\u002Fvl_demo\u002Flast.ckpt --benchmark vqav2,gqa,textvqa\n```\n\n## Why another trainer?\n\nA few things were missing in the libraries I looked at:\n\n1. **Modality-balanced loss reporting.** When you train V+A+T, you want to see the loss broken down per modality contribution, not a single scalar.\n2. **Lightweight Q-Former \u002F Resampler implementations** that don't import 5,000 lines of BLIP-2 code.\n3. **Sensible defaults for the LR ratio between adapter and LoRA.** Most papers pick this by hand; we have a config-driven default that just works.\n4. **Frozen-encoder feature caching.** When the vision encoder is frozen, you should never re-encode the same images. We cache to disk transparently.\n\n## Benchmarks\n\nWe've used this library to train models on the following tasks. Numbers below are not state-of-the-art — they are sanity-check baselines on small budgets.\n\n| Setup                    | Dataset       | Metric        | Score | Train hrs (8 × A100) |\n|--------------------------|---------------|---------------|-------|----------------------|\n| SigLIP + Qwen2.5-1.5B    | COCO captions | CIDEr         | 102.4 | 6                    |\n| Whisper-S + Qwen2.5-1.5B | AudioCaps     | SPIDEr        | 0.41  | 14                   |\n| Dual (V+A) + Qwen2.5-3B  | VATEX         | CIDEr         | 53.1  | 22                   |\n\nFor comparison, the BLIP-2 and SALMONN papers report higher numbers at much higher compute. The point of these baselines is that the training loop is correct.\n\n## Repository layout\n\n```\nunimm\u002F\n├── trainer.py            # the Trainer class\n├── config.py             # YAML\u002Fdataclass config loading\n├── modalities\u002F\n│   ├── vision.py\n│   ├── audio.py\n│   └── text.py\n├── adapters\u002F\n│   ├── linear.py\n│   ├── mlp.py\n│   ├── qformer.py\n│   └── resampler.py\n├── data\u002F\n│   ├── webdataset.py\n│   ├── hf_dataset.py\n│   └── collate.py\n├── peft\u002F\n│   └── lora.py\n├── monitor.py            # metrics, attention entropy, grad norms\n└── cli.py\n```\n\n## Citation\n\nNot really citable, but if you find this useful:\n\n```bibtex\n@misc{chen2025unimm,\n  author = {Chen, Yichen},\n  title  = {UniMM-Trainer: A small library for training multimodal LLMs},\n  year   = {2025},\n  url    = {https:\u002F\u002Fgithub.com\u002Fbandyah\u002Funi-mm-trainer}\n}\n```\n\n## License\n\nApache-2.0.\n","UniMM-Trainer 是一个用于训练结合文本、视觉和音频的多模态大模型的小型库。其核心功能包括方便地组合编码器（如冻结的音频或视觉编码器与语言模型），简化投影层的训练过程，以及在长时间运行中跟踪实际进展。该库支持多种流行的预训练模型，并通过配置文件灵活管理不同模态的集成。适用于需要快速搭建并实验多模态模型的研究场景，但不适用于生产环境部署。使用Python编写，易于安装和扩展。",2,"2026-06-01 03:48:13","CREATED_QUERY"]