[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79918":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":17,"compositeScore":13,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":10,"pushedAt":10,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},79918,"SWIM","HumanMLLM\u002FSWIM","HumanMLLM","Official Code for See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding (CVPR 2026)","",null,"Python",95,0,88,1,7,3,false,"main",[],"2026-06-12 02:03:55","\u003Cdiv align=\"center\">\n\n# SWIM: See What I Mean\n\n### Aligning Vision and Language Representations for Video Fine-grained Object Understanding\n\n[**Boyuan Sun**](https:\u002F\u002Fbbbbchan.github.io)\u003Csup>1,2\u003C\u002Fsup> &ensp; [**Bowen Yin**](https:\u002F\u002Fyinbow.github.io\u002F)\u003Csup>1,2\u003C\u002Fsup> &ensp; [**Yuanming Li**](https:\u002F\u002Flyman-smoker.github.io\u002F)\u003Csup>2\u003C\u002Fsup> &ensp; [**Xihan Wei**](https:\u002F\u002Fwww.zhihu.com\u002Fpeople\u002FHannahW)\u003Csup>2\u003C\u002Fsup> &ensp; [**Qibin Hou**](https:\u002F\u002Fhouqb.github.io\u002F)\u003Csup>1&dagger;\u003C\u002Fsup>\n\n\u003Csup>1\u003C\u002Fsup> VCIP, Nankai University &emsp; \u003Csup>2\u003C\u002Fsup> Tongyi Lab, Alibaba Group &emsp; \u003Csup>&dagger;\u003C\u002Fsup> Corresponding author\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18018\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-SWIM-red' alt='Paper PDF'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002FBBBBCHAN\u002FSWIM-7B'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Model-blue'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBBBBCHAN\u002FNL-Refer'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-NL--Refer-green'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.18018'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Paper-yellow'>\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n---\n\n## Overview\n\nSWIM enables multimodal large language models to understand **specific objects** in videos at a fine-grained level. Given a video and a **natural language reference** to a target object, SWIM can accurately describe the object's appearance, actions, and temporal dynamics &mdash; while avoiding hallucination about irrelevant objects.\n\n> **Core idea** &mdash; Apply *attention-level supervision* during training so the model learns to attend to the correct visual regions when generating descriptions of a referred object.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002FSWIM_pipeline.png\" width=\"90%\" alt=\"SWIM Pipeline\">\n\u003C\u002Fdiv>\n\n### Highlights\n\n\u003Ctable>\n\u003Ctr>\u003Ctd>1\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBBBBCHAN\u002FNL-Refer\">NL-Refer\u003C\u002Fa> Dataset\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>A natural-language referring dataset built on top of \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDAMO-NLP-SG\u002FVideoRefer-700K\">VideoRefer-700K\u003C\u002Fa>. Unlike the original which uses visual prompts (colored masks), NL-Refer replaces them with \u003Cb>natural language descriptions\u003C\u002Fb>, enabling a more practical and scalable referring paradigm.\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\u003Ctd>2\u003C\u002Ftd>\u003Ctd>\u003Cb>Attention Supervision\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>During SFT, the model receives additional loss on attention maps to encourage correct grounding between \u003Ccode>&lt;ins&gt;...&lt;\u002Fins&gt;\u003C\u002Fcode>-tagged entity tokens and the corresponding visual regions.\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\u003Ctd>3\u003C\u002Ftd>\u003Ctd>\u003Cb>Selective Fine-tuning\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>The vision encoder is frozen; only the language model is updated, keeping training efficient.\u003C\u002Ftd>\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n---\n\n## News\n\n- **2026-05** &mdash; Code of [SWIM](https:\u002F\u002Fgithub.com\u002FHumanMLLM\u002FSWIM) is released.\n- **2026-05** &mdash; Paper of [SWIM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18018) is released.\n\n---\n\n## Getting Started\n\n### 1. Installation\n\n```bash\ngit clone git@github.com:HumanMLLM\u002FSWIM.git\ncd SWIM\n\nconda create -n swim python=3.10\nconda activate swim\n\n# Core dependencies\ncd Q-R1\npip install -e .\npip install trl\n\n# Modified transformers (required for attention supervision)\ncd ..\u002Ftransformers\npip install -e .\n\npip install matplotlib huggingface_hub\n```\n\n> **Note** &mdash; SWIM depends on a custom fork of HuggingFace Transformers shipped in `transformers\u002F`. You must install it from this repo, **not** from PyPI.\n\n### 2. Download Model\n\n```bash\nmkdir model_zoo && cd model_zoo\n\n# (Optional) HuggingFace mirror for faster download in China\n# export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n\nhuggingface-cli download --resume-download BBBBCHAN\u002FSWIM-7B --local-dir SWIM-7B\n```\n\n### 3. Download Data &mdash; NL-Refer (For Training)\n\nWe introduce **NL-Refer**, a natural-language referring dataset built on top of [VideoRefer-700K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDAMO-NLP-SG\u002FVideoRefer-700K). Unlike the original dataset which uses **visual prompts** (colored masks overlaid on frames) to indicate target objects, NL-Refer replaces them with **natural language referring expressions** &mdash; enabling a more practical paradigm where users simply describe the object in words.\n\nThe dataset is constructed by using GPT-4o to rewrite `\u003Cobjectx>\u003Cregion>` placeholders into concise NL descriptions, with the core referring word tagged as `\u003Cins>...\u003C\u002Fins>` for attention supervision.\n\nDataset and construction scripts: [BBBBCHAN\u002FNL-Refer](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBBBBCHAN\u002FNL-Refer)\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Dataset Structure\u003C\u002Fb>\u003C\u002Fsummary>\n\n```\nNL-Refer\u002F\n├── train\u002F                                        # Training annotations\n│   ├── refined-format-videorefer-detailed-caption-*.json   # NL-Refer-D (~125K, 4 shards)\n│   ├── refined-format-videorefer-qa-0-10k.json             # NL-Refer-Q (~10K)\n│   └── filtered_valid_llava_video_178k_*.json              # LLaVA-Video supplementary\n├── bench\u002F                                        # Evaluation benchmarks\n│   ├── refined-VideoRefer-Bench-D.json           # Description generation (400 samples)\n│   ├── refined-VideoRefer-Bench-Q.json           # Multiple-choice QA (1000 samples)\n│   └── *-synonym.json                            # Synonym-augmented variants\n└── scripts\u002F                                      # Dataset construction pipeline\n    ├── construction\u002F                             # GPT-4o rewriting scripts\n    └── llava_video\u002F                              # LLaVA-Video processing\n```\n\nTraining annotation JSONs are hosted on [BBBBCHAN\u002FSWIM_data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBBBBCHAN\u002FSWIM_data) due to their size (~7GB total).\n\n\u003C\u002Fdetails>\n\n```bash\n# Download benchmarks and construction scripts\nhuggingface-cli download --resume-download BBBBCHAN\u002FNL-Refer --repo-type dataset --local-dir NL-Refer\n```\n\n---\n\n## Usage\n\n### Inference\n\nSWIM-7B is fine-tuned from [Qwen2.5-VL-7B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-7B-Instruct) and shares the same inference API.\n\n\u003Cdetails open>\n\u003Csummary>\u003Cb>Quick Start\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nimport torch\nfrom transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor\nfrom qwen_vl_utils import process_vision_info\n\n# Load model (flash_attention_2 recommended for speed and memory)\nmodel = Qwen2_5_VLForConditionalGeneration.from_pretrained(\n    \"BBBBCHAN\u002FSWIM-7B\",\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n    device_map=\"auto\",\n)\nprocessor = AutoProcessor.from_pretrained(\"BBBBCHAN\u002FSWIM-7B\")\n\n# Optionally control visual token budget:\n# processor = AutoProcessor.from_pretrained(\n#     \"BBBBCHAN\u002FSWIM-7B\", min_pixels=256*28*28, max_pixels=1280*28*28\n# )\n\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\n                \"type\": \"video\",\n                \"video\": \"file:\u002F\u002F\u002Fpath\u002Fto\u002Fvideo.mp4\",\n                \"max_pixels\": 360 * 420,\n                \"fps\": 1.0,\n            },\n            {\"type\": \"text\", \"text\": \"Describe this video.\"},\n        ],\n    }\n]\n\n# Prepare inputs\ntext = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\nimage_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)\ninputs = processor(\n    text=[text],\n    images=image_inputs,\n    videos=video_inputs,\n    padding=True,\n    return_tensors=\"pt\",\n    **video_kwargs,\n).to(\"cuda\")\n\n# Generate\ngenerated_ids = model.generate(**inputs, max_new_tokens=128)\ngenerated_ids_trimmed = [\n    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n]\noutput_text = processor.batch_decode(\n    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False\n)\nprint(output_text)\n```\n\n\u003C\u002Fdetails>\n\n### Training\n\nTraining uses DeepSpeed Zero-3 with 8 GPUs, BF16 mixed precision, and Flash Attention 2.\n\n```bash\ncd Q-R1\u002Fsrc\u002Fopen-r1-multimodal\n\n# Edit run_scripts\u002Frun_sft_videorefer_qwen25vl.sh to set:\n#   --model_name_or_path  (path to Qwen2.5-VL-7B-Instruct)\n#   --image_root          (path to your image\u002Fvideo data root)\n\nbash run_scripts\u002Frun_sft_videorefer_qwen25vl.sh\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Key Training Parameters\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Parameter | Value |\n|:---|:---|\n| Base model | Qwen2.5-VL-7B-Instruct |\n| Batch size | 1 per GPU &times; 4 gradient accumulation steps |\n| Learning rate | 2e-5 (cosine schedule) |\n| Epochs | 1 |\n| Precision | BF16 |\n| Distribution | DeepSpeed Zero-3 offload |\n\nTraining data is configured in `Q-R1\u002Fsrc\u002Fopen-r1-multimodal\u002Fdata_config\u002Fvideorefer.yaml`, combining ~125K [NL-Refer](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBBBBCHAN\u002FNL-Refer) samples with LLaVA-Video data.\n\n\u003C\u002Fdetails>\n\n### Evaluation\n\n#### VideoRefer-Bench\n\n```bash\ncd Q-R1\u002Fsrc\u002Fopen-r1-multimodal\u002Frun_scripts\u002Feval\u002Fvideorefer\n\n# VideoRefer-Bench-Q (multiple-choice QA)\nbash eval_videorefer-bench-q_qwen2_5vl.sh\n\n# VideoRefer-Bench-D (description generation, requires GPT-4o API key)\nbash eval_videorefer-bench-d_qwen2_5vl.sh\n```\n\nFor benchmark data and format details, see the [VideoRefer-Bench README](Q-R1\u002Fsrc\u002Fopen-r1-multimodal\u002Frun_scripts\u002Feval\u002Fvideorefer\u002FREADME.md).\n\n#### General Benchmarks\n\nGeneral video understanding benchmarks (MVBench, VideoMME, ActivityNetQA, etc.) are evaluated via [lmms_eval](https:\u002F\u002Fgithub.com\u002Flmms-lab\u002Flmms_eval):\n\n```bash\nMODEL_NAME=\"SWIM-7B\"\nMODEL_PATH=\"BBBBCHAN\u002FSWIM-7B\"\n\naccelerate launch --num_processes 8 --main_process_port 23553 -m lmms_eval \\\n    --model qwen2_5_vl \\\n    --model_args pretrained=$MODEL_PATH,use_flash_attention_2=true \\\n    --tasks mvbench \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix eval \\\n    --output_path .\u002Flogs\u002F$MODEL_NAME\n```\n\nReplace `--tasks mvbench` with `videomme`, `activitynetqa`, etc. for other benchmarks.\n\n---\n\n## Project Structure\n\n```\nSWIM\u002F\n├── Q-R1\u002Fsrc\u002Fopen-r1-multimodal\u002F\n│   ├── src\u002Fopen_r1\u002F\n│   │   ├── sft_videorefer_qwen25vl.py    # Training entry, data loading, collation\n│   │   ├── inference_qwen25vl.py          # Inference & attention visualization\n│   │   ├── calc_attn_mask*.py             # Attention mask analysis tools\n│   │   ├── data_process\u002F\n│   │   │   └── vision_process.py          # Video frame extraction & image processing\n│   │   ├── trainer\u002F                       # Custom trainers (GRPO, OLA-GRPO, vLLM-GRPO)\n│   │   └── utils\u002F                         # Evaluation helpers & callbacks\n│   ├── data_config\u002Fvideorefer.yaml        # Training dataset configuration\n│   ├── run_scripts\u002F\n│   │   ├── run_sft_videorefer_qwen25vl.sh # Training launch script\n│   │   └── eval\u002F                          # Evaluation scripts\n│   └── configs\u002F                           # DeepSpeed \u002F DDP configs\n├── transformers\u002Fsrc\u002Ftransformers\u002Fmodels\u002F\n│   └── qwen2_5_vl\u002Fmodeling_qwen2_5_vl.py # Modified model with attention supervision\n└── vis\u002F                                   # Visualization utilities\n```\n\n### Core Code Guide\n\nBelow is a map of the key code paths for anyone looking to understand or extend SWIM.\n\n#### Training Pipeline\n\n> `Q-R1\u002Fsrc\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fsft_videorefer_qwen25vl.py`\n\n| What | Location | Description |\n|:---|:---|:---|\n| Dataset class | `LazySupervisedDataset` (L99) | Loads multi-dataset from YAML config with flexible sampling strategies |\n| Data format conversion | `_maybe_apply_format_convert_videorefer` | Converts VideoRefer JSON to conversation format, decodes RLE masks |\n| Instance tag extraction | `extract_ins_with_global_occurrence` (L337) | Extracts `\u003Cins>...\u003C\u002Fins>` tagged entities and their occurrence counts |\n| Batch collation | `collate_fn` (L380) | Tokenizes text, processes vision inputs, and constructs supervision labels |\n| &emsp; Attention labels | &emsp; L496 &ndash; L512 | Creates `attn_labels` marking valid token positions for attention loss |\n| &emsp; Instance labels | &emsp; L515 &ndash; L541 | Creates `ins_contents_labels` mapping entity tokens to instance indices |\n| Trainer init | `SFTTrainer(...)` (L653) | Assembles model, dataset, and collate_fn into the trainer |\n| Training loop | `trainer.train()` (L673) | Launches the training loop with optional checkpoint resume |\n\n#### Loss Computation\n\n> `transformers\u002Fsrc\u002Ftransformers\u002Fmodels\u002Fqwen2_5_vl\u002Fmodeling_qwen2_5_vl.py`\n\n| What | Location | Description |\n|:---|:---|:---|\n| Text loss | `loss_function` (L2041) | Standard cross-entropy on language modeling logits |\n| Attention extraction | `extract_and_fuse_attentions` (L2078) | Extracts attention maps from layers [2, 7, 12, 17, 22, 27] and fuses across heads |\n| Label filtering | `build_label_and_index` (L2177) | Filters out ignored tokens (-100) to get valid supervision indices |\n| Pred-GT pair collection | `collect_pred_gt_pairs_from_fused_attn` (L2113) | Pairs predicted attention masks with ground-truth object masks |\n| Mask loss | `compute_bce_loss_from_pairs` (L2212) | Binary cross-entropy between predicted attention and GT masks |\n| **Combined loss** | **L2373** | **`loss = text_loss * 0.05 + loss_mask`** |\n\n#### Video \u002F Image Processing\n\n> `Q-R1\u002Fsrc\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fdata_process\u002Fvision_process.py`\n\n| What | Location | Description |\n|:---|:---|:---|\n| Video loading | `fetch_video` (L459) | Reads video via Decord, samples frames at target FPS, smart resize |\n| Image loading | `fetch_image` (L106) | Handles local \u002F HTTP \u002F base64 \u002F PIL sources, aspect-ratio-aware resize |\n| Vision info dispatch | `process_vision_info` (L678) | Extracts mask info from conversations, routes to image or video processing |\n\n#### Model Modifications\n\n> `transformers\u002Fsrc\u002Ftransformers\u002Fmodels\u002Fqwen2_5_vl\u002Fmodeling_qwen2_5_vl.py`\n\nThe Qwen2.5-VL `forward` pass (L2235) is extended with four additional inputs for attention supervision:\n\n| Parameter | Purpose |\n|:---|:---|\n| `attn_labels` | Marks which token positions participate in the attention loss |\n| `ins_contents_labels` | Maps each entity token to its instance index |\n| `ins_masks` | Ground-truth binary segmentation masks per instance |\n| `mask_index` | Indicates which video frames carry mask annotations |\n\nThe attention supervision pipeline runs at L2339 &ndash; L2363: extract multi-layer attention &rarr; filter valid tokens &rarr; collect pred\u002FGT pairs &rarr; compute BCE loss.\n\n---\n\n## Citation\n\nIf you find this work useful, please consider citing:\n\n```bibtex\n@inproceedings{sun2026swim,\n  title     = {See What I Mean: Aligning Vision and Language Representations\n               for Video Fine-grained Object Understanding},\n  author    = {Sun, Boyuan and Yin, Bowen and Li, Yuanming and Wei, Xihan and Hou, Qibin},\n  booktitle = {IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n  year      = {2026}\n}\n```\n\n## License\n\nThis code is licensed under [CC BY-NC 4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F) for non-commercial use only. Commercial use requires prior written permission.\n\n## Contact\n\n| | |\n|:---|:---|\n| Technical questions | `sbysbysby123[AT]gmail.com` |\n| Commercial licensing | `andrewhoux[AT]gmail.com` |\n| Jobs \u002F internships at Tongyi Lab | `xihan.wxh@alibaba-inc.com` &ensp; (WeChat: weixihan1) |\n\n## Acknowledgement\n\nWe thank [open-r1](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fopen-r1), [PixelRefer](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FPixelRefer), [Qwen2.5-VL](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-7B-Instruct), [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers), [lmms_eval](https:\u002F\u002Fgithub.com\u002Flmms-lab\u002Flmms_eval), and [LLaVA-Video-178K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flmms-lab\u002FLLaVA-Video-178K) for their excellent work.\n","SWIM项目旨在通过视觉和语言表征对齐，实现视频中细粒度对象的理解。其核心功能包括基于自然语言引用识别视频中的特定对象，并准确描述该对象的外观、行为及时间动态，同时避免对无关对象产生错误理解。技术上，SWIM采用了注意力层面的监督训练方法，确保模型能够正确地将文本标记与对应的视觉区域关联起来；此外，它还利用了选择性微调策略，仅更新语言模型部分以提高效率。该项目适用于需要从视频内容中提取精确信息的应用场景，如视频分析、智能监控等。",2,"2026-06-11 03:58:31","CREATED_QUERY"]