[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80150":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":14,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},80150,"Vision-OPD","VisionOPD\u002FVision-OPD","VisionOPD","Vision-OPD is a regional-to-global on-policy self-distillation framework that transfers a model's own privileged crop-conditioned perception to its full-image policy, enabling fine-grained visual understanding in a single forward pass without external teachers, labels, or verifiers.","",null,"Python",106,3,1,4,0,21,51,7,1.81,"Apache License 2.0",false,"main",[],"2026-06-12 02:03:58","# Vision-OPD\n\n**Vision-OPD: Learning to See Fine-Grained Details for Multimodal LLMs via On-Policy Self-Distillation**\n\n\u003Cp align=\"center\">\n📃 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18740\" target=\"_blank\">Paper\u003C\u002Fa> | \n🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyuanqianhao\u002FVision-OPD-6K\">Training Dataset\u003C\u002Fa> \n\u003C\u002Fp>\n\n## News\n\n- **[2026.5.18]** The paper is released on arXiv.\n- **[2026.5.19]** The training code and data is released.\n- **[2026.5.26]** The evaluation code is released.\n- Model release is under company review. Coming soon.\n\n## Overview\n\nVision-OPD is a regional-to-global on-policy self-distillation framework that transfers a model's own privileged regional perception to its full-image policy, enabling fine-grained visual understanding in a single forward pass — without external teachers, ground-truth labels, reward verifiers, or inference-time tool use.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Faverage_bar_chart.png\" alt=\"Vision-OPD Average Scores\" width=\"60%\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Ci>Average scores across fine-grained visual understanding benchmarks, including V* Bench, ZoomBench,  HR-Bench 4K, HR-Bench 8K, MME-RealWorld-EN and MME-RealWorld-CN.\u003C\u002Fi>\u003C\u002Fp>\n\n## Quick Start\n\n### 1. Environment Setup\n\n```bash\nconda create -n vision-opd python=3.12\nconda activate vision-opd\npip install --upgrade pip\npip install --no-deps -r requirements.txt\npip install -e . --no-deps\npip install flash-attn --no-build-isolation\npip install causal-conv1d==1.6.1 --no-build-isolation\n```\n\n### 2. Prepare Training Data\n\nDownload and preprocess the [Vision-OPD-6K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyuanqianhao\u002FVision-OPD-6K) dataset:\n\n```bash\npython scripts\u002Fprepare_data.py --data-dir .\u002Fdata\n```\n\nThis downloads images and metadata from HuggingFace, extracts archives, and converts `train.jsonl` to the parquet format expected by the training pipeline.\n\n### 3. Training\n\nLaunch Vision-OPD training:\n\n```bash\nbash scripts\u002Frun_vision_opd.sh\n```\n\nKey hyperparameters can be edited at the top of the script. See the script for the full configuration.\n\n### 4. Merge Checkpoints\n\nAfter training, merge the FSDP-sharded checkpoint into a standard HuggingFace model:\n\n```bash\nbash scripts\u002Fmerge_checkpoint.sh \u003Cpath_to_checkpoint>\n```\n\nFor example:\n\n```bash\nbash scripts\u002Fmerge_checkpoint.sh .\u002Fcheckpoints\u002FVision-OPD-Qwen3.5-4B\u002Fglobal_step_65\u002F\n```\n\nThis merges the FSDP actor shards, saves the model weights, config, tokenizer, and processor into the specified directory. The merged checkpoint can then be loaded directly with `transformers` or served with vLLM.\n\n### 5. Deployment\n\nServe the merged checkpoint with vLLM, for example:\n\n```bash\nvllm serve \u003Cpath_to_merged_checkpoint> \\\n    --gpu-memory-utilization 0.85 \\\n    --tensor-parallel-size 8 \\\n    --served-model-name Vision-OPD-4B \\\n    --trust-remote-code\n```\n\nThe server listens on port 8000 by default. You can then query the model via the OpenAI-compatible API at `http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions`.\n\n### 6. Evaluation\n\nEvaluate the deployed model on fine-grained visual benchmarks:\n\n```bash\nAPI_BASE=\"http:\u002F\u002Flocalhost:8000\u002Fv1\u002F\" \\\nOPENAI_MODEL_ID=\"Vision-OPD-4B\" \\\nJUDGE_API_BASE=\"YOUR_JUDGE_API_BASE\" \\\nJUDGE_MODEL=\"YOUR_JUDGE_MODEL_NAME\" \\\nBENCHMARK=\"vstar,zoombench,hrbench-4k,hrbench-8k,mme-realworld,mme-realworld-cn\" \\\nbash eval\u002Frun_eval.sh\n```\n\nSupported benchmarks: `vstar`, `zoombench`, `hrbench-4k`, `hrbench-8k`, `mme-realworld`, `mme-realworld-cn`.\n\nThe evaluation script runs inference via the OpenAI-compatible API. Judge configuration can be set via `JUDGE_API_BASE` \u002F `JUDGE_MODEL_PATH` environment variables.\n\n## Project Structure\n\n```\nVision-OPD\u002F\n├── verl\u002F                    # Modified verl framework with self-distillation support\n├── scripts\u002F\n│   ├── run_vision_opd.sh    # Training launch script\n│   ├── merge_checkpoint.sh  # FSDP checkpoint merger\n│   └── prepare_data.py      # Data download & preprocessing\n├── eval\u002F\n│   ├── run_eval.sh          # Evaluation entry point\n│   ├── prepare_data.py      # Benchmark data download & preparation\n│   ├── infer.py             # OpenAI API inference\n│   ├── judge_qwenlm.py      # LLM-based judge (API or local vLLM)\n│   ├── cal_acc.py           # Accuracy calculation\n│   └── vstar_bench_utils.py # V*Bench question formatting utilities\n├── chat_templates\u002F\n│   └── perception_chat_template_qwen35.jinja\n├── figures\u002F                 # Paper figures\n├── pyproject.toml\n└── LICENSE\n```\n\n## Citation\n\nIf you find Vision-OPD useful for your research, please consider citing:\n\n```bibtex\n@article{yuan2026vision,\n  title={Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation},\n  author={Yuan, Qianhao and Lou, Jie and Yu, Xing and Lin, Hongyu and Sun, Le and Han, Xianpei and Lu, Yaojie},\n  journal={arXiv preprint arXiv:2605.18740},\n  year={2026}\n}\n```\n\n## License\n\nApache-2.0 License\n","Vision-OPD 是一个从局部到全局的在线自蒸馏框架，能够将模型自身的特权区域感知能力转移到全图策略中，在单次前向传递中实现细粒度视觉理解，无需外部教师、标签或验证器。该项目采用Python语言编写，利用了先进的自蒸馏技术，使得模型能够在不依赖额外资源的情况下提升对图像细节的理解能力。它特别适用于需要高效准确地处理复杂视觉信息的应用场景，如多模态大语言模型中的视觉理解任务。此外，Vision-OPD 提供了详细的环境配置、数据准备、训练、检查点合并以及部署指南，方便研究者和开发者快速上手并根据具体需求定制化使用。",2,"2026-06-11 03:59:26","CREATED_QUERY"]