[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72065":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":36,"readmeContent":37,"aiSummary":38,"trendingCount":16,"starSnapshotCount":16,"syncStatus":39,"lastSyncTime":40,"discoverSource":41},72065,"VLM-R1","om-ai-lab\u002FVLM-R1","om-ai-lab","Solve Visual Understanding with Reinforced VLMs","",null,"Python",5971,380,45,164,0,4,7,19,12,38.74,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34,35],"deepseek-r1","grpo","llm","multimodal","multimodal-r1","qwen","r1-zero","reinforcement-learning","vlm","vlm-r1","2026-06-12 02:02:58","# VLM-R1: A stable and generalizable R1-style Large Vision-Language Model\n\n\u003Cfont size=4>\u003Cdiv align='center' > [[🤗 REC Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fomlab\u002FVLM-R1-Referral-Expression)] [[🤗 OVD Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fomlab\u002FVLM-R1-OVD)] [[🤗 REC Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fomlab\u002FVLM-R1)] [[🤗 Checkpoints](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fomlab\u002Fvlm-r1-models-67b7352db15c19d57157c348)] [[ModelScope Checkpoints](https:\u002F\u002Fmodelscope.cn\u002Fcollections\u002FOm_AI_Lab\u002FVLM-R1-models)] \u003C\u002Fdiv>\u003C\u002Ffont>\n\n\u003Cfont size=4>\u003Cdiv align='center'>[[📄 Tech Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07615)] [[📝 Blog](https:\u002F\u002Fom-ai-lab.github.io\u002Findex.html)]\u003C\u002Fdiv>\u003C\u002Ffont>\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\".\u002Fassets\u002Fperformance4.png\" width=\"900\"\u002F>\n\u003Cdiv>\n  \u003Cfont size=4>\n    \u003Cp>🎉  \u003Cb>Our VLM-R1 Math model reaches the top of the Open-Compass Math Leaderboard (under 4B parameters) and OVD model achieves the state-of-the-art performance on OVDEval.\u003C\u002Fb>\u003C\u002Fp>\n  \u003C\u002Ffont>\n\u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\nSince the introduction of [Deepseek-R1](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1), numerous works have emerged focusing on reproducing and improving upon it. In this project, we propose VLM-R1, a stable and generalizable R1-style Large Vision-Language Model.\n\nSpecifically, for the task of Referring Expression Comprehension (REC), we trained [Qwen2.5-VL](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-VL) using both R1 and SFT approaches. The results reveal that, on the in-domain test data, the performance of the SFT model shows little change compared to that of the R1 model base model when the number of training steps is relatively small (100–600 steps), while the R1 model shows a steady improvement (as shown at the left of the figure below). More importantly, on the out-of-domain test data, the SFT model's performance deteriorates slightly as the number of steps increases. Nevertheless, the RL model generalizes its reasoning ability to the out-of-domain data (as shown at the right of the figure below).\n\n![image](.\u002Fassets\u002Fperformance3.png)\n\\* *We found previous REC SFT exps used a mismatch pixel config. Therefore, we re-run the study with the correct config on a more complex out-of-domain data. See our [findings](https:\u002F\u002Fom-ai-lab.github.io\u002F2025_03_24.html) for details.*\n\n## 🚀 Features\n\nThis repository supports:\n\n- **`Full Fine-tuning for GRPO`**: see [run_grpo_rec.sh](run_scripts\u002Frun_grpo_rec.sh)\n- **`Freeze Vision Modules`**: set `freeze_vision_modules` as `true` in the script.\n- **`LoRA Fine-tuning for GRPO`**: see [run_grpo_rec_lora.sh](run_scripts\u002Frun_grpo_rec_lora.sh)\n- **`Multi-node Training`**: see [multinode_training_demo.sh](run_scripts\u002Fmultinode_training_demo.sh)\n- **`Multi-image Input Training`**: see [run_grpo_gui.sh](run_scripts\u002Frun_grpo_gui.sh)\n- **`For your own data`**: see [here](#for-your-own-data)\n- **`Various VLMs`**: see [How to add a new model](assets\u002Fadd_new_model.md), now we support QwenVL and InternVL\n\n## 🗞️ Update\n\n- **`2025-08-29`**: 🔥🔥🔥 We have further optimized the VLM-R1 series models based on JD's latest open-source inference framework `xllm` (github is [here](https:\u002F\u002Fgithub.com\u002Fjd-opensource\u002Fxllm)). The TTFT (Time to First Token) has been reduced by 50% compared to `vllm-ascend`, and the overall throughput has increased by 127% compared to `vllm-ascend`. Please refer to [ascend_inference\u002F910B\u002Fxllm\u002FREADME.md](ascend_inference\u002F910B\u002Fxllm\u002FREADME.md) for more details.\n\n- **`2025-08-22`**: We have adapted the VLM-R1 series models to Huawei Ascend Atlas 800T A2 and Atlas 300I Duo series using the vllm-ascend framework, further expanding the deployment scenarios and hardware compatibility of the model series. Please refer to [ascend_inference\u002F910B\u002Fvllm_ascend\u002FREADME.md](ascend_inference\u002F910B\u002Fvllm_ascend\u002FREADME.md) and [ascend_inference\u002F300IDuo\u002FREADME.md](ascend_inference\u002F300IDuo\u002FREADME.md) for more details.\n\n- **`2025-06-26`**: We introduce a post-resize operation for the bounding box for QwenVL (both [training](src\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fvlm_modules\u002Fqwen_module.py#L124-L129) and [evaluation](src\u002Feval\u002Ftest_rec_r1.py#L92-L97)) and the results are improved slightly.\n- **`2025-04-16`**: We have updated the codebase to improve functionality and maintain unified implementation. Specifically, the REC process is now integrated into [grpo_jsonl.py](src\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fgrpo_jsonl.py) for consistency across tasks. Additionally, we introduce a new parameter, `is_reward_customized_from_vlm_module`, which enables the use of customized reward functions defined within the VLM module. When set to `true`, the reward logic is handled in either [QwenVL2Module](src\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fvlm_modules\u002Fqwen_module.py) or [InternVLModule](src\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fvlm_modules\u002Finternvl_module.py), depending on the selected model. Furthermore, the training log has been enhanced to provide more detailed output for easier monitoring and debugging.\n- **`2025-04-11`**: 🔥🔥🔥 We release the [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07615) of VLM-R1, summarizing our main results and insights.\n- **`2025-04-03`**: We add the `odLength`, `weighted_sum`, and `cosine` reward used in OVD task, please refer our [blog post](https:\u002F\u002Fom-ai-lab.github.io\u002F2025_03_20.html) and [findings](https:\u002F\u002Fom-ai-lab.github.io\u002F2025_03_24.html) to the details of the reward usage and see [grpo_jsonl.py](src\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fgrpo_jsonl.py) for code implementation.\n- **`2025-03-24`**: 🔥 We release the [findings](https:\u002F\u002Fom-ai-lab.github.io\u002F2025_03_24.html) of VLM-R1-OVD.\n- **`2025-03-23`**: 🔥 We release the VLM-R1-OVD [model weights](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FVLM-R1-Qwen2.5VL-3B-OVD-0321) and [demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fomlab\u002FVLM-R1-OVD), which shows the state-of-the-art performance on OVDEval. Welcome to use it.\n- **`2025-03-20`**: 🔥 We achieved SOTA results on [OVDEval](https:\u002F\u002Fgithub.com\u002Fom-ai-lab\u002FOVDEval) with our RL-based model, outperforming SFT baselines and specialized object detection models. Read our [blog post](https:\u002F\u002Fom-ai-lab.github.io\u002F2025_03_20.html) for details on how reinforcement learning enhances object detection performance.\n- **`2025-03-17`**: Our VLM-R1 Math model reaches the top of the [Open-Compass Math Leaderboard](https:\u002F\u002Frank.opencompass.org.cn\u002Fleaderboard-multimodal-reasoning\u002F?m=REALTIME) (under 4B parameters). We have released the [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FVLM-R1-Qwen2.5VL-3B-Math-0305).\n- **`2025-03-15`**: We support multi-image input data. Check the format of multi-image input [here](#for-your-own-data). We also provide an example of multi-image script [run_grpo_gui.sh](run_scripts\u002Frun_grpo_gui.sh), see [here](#for-your-own-data) for details.\n- **`2025-03-13`**: We support InternVL for GRPO. See [run_grpo_rec_internvl.sh](run_scripts\u002Frun_grpo_rec_internvl.sh) for details. The annotation json files used in InternVL are [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fomlab\u002FVLM-R1\u002Fresolve\u002Fmain\u002Frec_jsons_internvl.zip). If you want to add your new model, please refer to [How to add a new model](assets\u002Fadd_new_model.md).\n- **`2025-03-02`**: We support LoRA Fine-tuning for GRPO. See [run_grpo_rec_lora.sh](run_scripts\u002Frun_grpo_rec_lora.sh) for details.\n- **`2025-02-27`**: We support the `number of iterations per batch` and `epsilon value for clipping` in the original GRPO algorithm with args: `--num_iterations` and `--epsilon`.\n- **`2025-02-25`**: We support multi-node training for GRPO. See [multinode_training_demo.sh](run_scripts\u002Fmultinode_training_demo.sh) for details.\n- **`2025-02-21`**: We release the [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FQwen2.5VL-3B-VLM-R1-REC-500steps) of the VLM-R1 REC model.\n- **`2025-02-20`**: We release the script for [general data loading](#for-your-own-data).\n- **`2025-02-19`**: We incorporate an explanation of the [SFT](#sft) method.\n- **`2025-02-17`**: We release the VLM-R1 REC [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fomlab\u002FVLM-R1-Referral-Expression) on Hugging Face Spaces.\n- **`2025-02-15`**: We release the VLM-R1 repository and [GRPO](#grpo) training script.\n\n## 🤖 Models\n\n- **[`OVD`](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FVLM-R1-Qwen2.5VL-3B-OVD-0321)**: Trained with VLM-R1, our Open-Vocabulary Detection (OVD) model achieves the state-of-the-art performance on OVDEval.\n- **[`Math`](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FVLM-R1-Qwen2.5VL-3B-Math-0305)**: Through VLM-R1 training, our math model focuses on multimodal reasoning tasks and has achieved Top1 on the OpenCompass Multi-modal Reasoning Leaderboard among models \u003C 4B.\n- **[`REC`](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FQwen2.5VL-3B-VLM-R1-REC-500steps)**: Trained with VLM-R1, our Referring Expression Comprehension (REC) model showcases the superior performance on out-of-domain data and a series of reasoning-grounding tasks.\n- **[`GUI`](https:\u002F\u002Fhuggingface.co\u002Fkonkazzz\u002FGT-r1)**: Trained with VLM-R1, our GUI Defect Detection model outperforms both base and SFT models by achieving the best accuracy and improved generalization across both defective and clean screens.\n\n| Version                          | Base VLM     | Checkpoint                                                                                           | Task Type                 |\n| -------------------------------- | ------------ | ---------------------------------------------------------------------------------------------------- | ------------------------- |\n| VLM-R1-Qwen2.5VL-3B-OVD-0321     | Qwen2.5VL-3B | [omlab\u002FVLM-R1-Qwen2.5VL-3B-OVD-0321](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FVLM-R1-Qwen2.5VL-3B-OVD-0321)         | Open-Vocabulary Detection |\n| VLM-R1-Qwen2.5VL-3B-Math-0305    | Qwen2.5VL-3B | [omlab\u002FVLM-R1-Qwen2.5VL-3B-Math-0305](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FVLM-R1-Qwen2.5VL-3B-Math-0305)       | Multi-Modal Math          |\n| VLM-R1-Qwen2.5VL-3B-REC-500steps | Qwen2.5VL-3B | [omlab\u002FQwen2.5VL-3B-VLM-R1-REC-500steps](https:\u002F\u002Fhuggingface.co\u002Fomlab\u002FQwen2.5VL-3B-VLM-R1-REC-500steps) | REC\u002FReasoning-Grounding   |\n\n## 🎯 ToDo\n\n- [X] Implement multi-node training.\n- [X] Implement LoRA Fine-tuning.\n- [X] Support more Multimodal LLMs.\n- [X] Support multi-image input.\n- [X] Release the VLM-R1 Math model.\n- [X] Release the blog of VLM-R1.\n- [X] Release the VLM-R1-OVD model.\n- [X] Release the technical report of VLM-R1.\n- [X] Adapt to Huawei Ascend Atlas 800T A2 and Atlas 300I Duo series using the vllm-ascend framework.\n- [X] Adapt to Huawei Ascend Atlas 800T A2 series using the xllm framework.\n- [ ] Study cross task generalization.\n- [ ] Enhance VLM for other tasks [welcome issue].\n\n## 🛠️ Setup\n\n```bash\nconda create -n vlm-r1 python=3.10\nconda activate vlm-r1\nbash setup.sh\n```\n\n## 💪🏻 Training\n\n### Referring Expression Comprehension (REC)\n\n#### 📚 GRPO\n\n1. Download the [COCO Train2014 image](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fomlab\u002FVLM-R1\u002Fresolve\u002Fmain\u002Ftrain2014.zip) and unzip it, and we refer to the image dir as `\u003Cyour_image_root>`.\n2. Download the [RefCOCO\u002F+\u002Fg and LISA-Grounding Annotation files](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fomlab\u002FVLM-R1\u002Fresolve\u002Fmain\u002Frec_jsons_processed.zip) and unzip it (LISA-Grounding is used for out-of-domain evaluation).\n3. Change the `data_paths` and `image_folders` in the [run_scripts\u002Frun_grpo_rec.sh](run_scripts\u002Frun_grpo_rec.sh) file.\n\n```bash\n# These jsonl files are included in the annotation files at step 2.\n# Note: please use jsonl files instead of json files.\ndata_paths=\"path\u002Fto\u002Frefcoco_train.jsonl:path\u002Fto\u002Frefcocop_train.jsonl:path\u002Fto\u002Frefcocog_train.jsonl\"\nimage_folders=\"path\u002Fto\u002Fcoco:path\u002Fto\u002Fcoco:path\u002Fto\u002Fcoco\"\n```\n\n4. ``bash run_scripts\u002Frun_grpo_rec.sh``\n\n> [!NOTE]\n> If you encounter 'CUDA out of memory' error, you can try to reduce the `per_device_train_batch_size`.\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\".\u002Fassets\u002Fiou.jpg\" width=\"750\"\u002F>\n\u003C\u002Fdiv>\n\u003C!-- ![image](.\u002Fassets\u002Fwandb.jpg) -->\n\n#### 📚 Multi-Node GRPO\n\nFor multi-node training, please refers to [multinode_training_demo.sh](src\u002Fopen-r1-multimodal\u002Fmultinode_training_demo.sh).\n\n#### 📚 SFT\n\nWe use [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) to train the SFT model.\n\n1. Clone the [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) repository and install the dependencies.\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory.git\ncd LLaMA-Factory\npip install -e \".[torch,metrics]\"\n```\n\n2. Download the dataset_info.json, mllm_rec_json.json, and qwen2_5_vl_full_sft.yaml we provided [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fomlab\u002FVLM-R1\u002Ftree\u002Fmain\u002Fsft_related). Put the json files in the `LLaMA-Factory\u002Fdata` directory and the yaml file in the `LLaMA-Factory\u002Fexamples\u002Ftrain_full` directory.\n3. Run the following command to train the SFT model.\n\n```bash\nllamafactory-cli train examples\u002Ftrain_full\u002Fqwen2_5_vl_full_sft.yaml\n```\n\n### For your own data\n\n\u003Cdiv style=\"text-align: justify;\">\n\nWe support data loading the jsonl data of this format in [`src\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fgrpo_jsonl.py`](src\u002Fopen-r1-multimodal\u002Fsrc\u002Fopen_r1\u002Fgrpo_jsonl.py). Please note that you may need to use different reward functions for your specialized tasks. Welcome to PR to add your own reward functions or share any other interesting findings!\n\n\u003C\u002Fdiv>\n\nThe jsonl has the format as follows:\n\n```json\n{\n  \"id\": 1,\n  \"image\": \"Clevr_CoGenT_TrainA_R1\u002Fdata\u002Fimages\u002FCLEVR_trainA_000001_16885.png\",\n  \"conversations\": [\n    {\"from\": \"human\", \"value\": \"\u003Cimage>What number of purple metallic balls are there?\"},\n    {\"from\": \"gpt\", \"value\": \"0\"}\n  ]\n}\n```\n\nIf you want to use multi-image input, you can use the following format:\n\n```json\n{\n  \"id\": 1,\n  \"image\": [\"Clevr_CoGenT_TrainA_R1\u002Fdata\u002Fimages\u002FCLEVR_trainA_000001_16885.png\", \"Clevr_CoGenT_TrainA_R1\u002Fdata\u002Fimages\u002FCLEVR_trainA_000001_16886.png\"],\n  \"conversations\": [\n    {\"from\": \"human\", \"value\": \"\u003Cimage>\u003Cimage>What number of purple metallic balls in total within the two images?\"},\n    {\"from\": \"gpt\", \"value\": \"3\"}\n  ]\n}\n```\n\n> [!NOTE]\n> The image path in the jsonl file should be relative to the image folder specified in `--image_folders`. The absolute path of the input image is constructed as `os.path.join(image_folder, data['image'])`. For example:\n\n- If your jsonl has `\"image\": \"folder1\u002Fimage1.jpg\"`\n- And you specify `--image_folders \"\u002Fpath\u002Fto\u002Fimages\u002F\"`\n- The full image path will be `\u002Fpath\u002Fto\u002Fimages\u002Ffolder1\u002Fimage1.jpg`\n\nMultiple data files and image folders can be specified using \":\" as a separator:\n\n```bash\n--data_file_paths \u002Fpath\u002Fto\u002Fdata1.jsonl:\u002Fpath\u002Fto\u002Fdata2.jsonl \\\n--image_folders \u002Fpath\u002Fto\u002Fimages1\u002F:\u002Fpath\u002Fto\u002Fimages2\u002F\n```\n\nThe script can be run like this:\n\n```bash\n# You could refer to the run_grpo_rec.sh for the example\ntorchrun --nproc_per_node=\"8\" \\\n    --nnodes=\"1\" \\\n    --node_rank=\"0\" \\\n    --master_addr=\"127.0.0.1\" \\\n    --master_port=\"12345\" \\\n  src\u002Fopen_r1\u002Fgrpo_jsonl.py \\\n    --output_dir output\u002F$RUN_NAME \\\n    --model_name_or_path Qwen\u002FQwen2.5-VL-3B-Instruct \\\n    --deepspeed ${REPO_HOME}\u002Fsrc\u002Fopen-r1-multimodal\u002Flocal_scripts\u002Fzero3.json \\\n    --data_file_paths \u002Fpath\u002Fto\u002Fyour\u002Fdata.jsonl \\ # can be multiple, separated by \":\"\n    --image_folders \u002Fpath\u002Fto\u002Fyour\u002Fimage\u002Ffolder \\ # can be multiple, separated by \":\"\n    ...\n```\n\n\u003Cdiv style=\"text-align: justify;\">\n\n### Multi-image Input\nWe provide an example of multi-image script [run_grpo_gui.sh](src\u002Fopen-r1-multimodal\u002Frun_scripts\u002Frun_grpo_gui.sh). This task requires the model to analyze two GUI screenshots, taken before and after a user action, to determine if any UI interaction defects are present, which is from [GUI-Testing-Arena](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsongjah\u002FGTArena-UI-Defects). Download the [image](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fomlab\u002FVLM-R1\u002Fresolve\u002Fmain\u002Fgui_multi-image.zip) and unzip it into the `\u002Fpath\u002Fto\u002Fimages\u002F`. Then modify the `image_folders` parameter in the script and run it.\n\n```bash\nbash run_scripts\u002Frun_grpo_gui.sh\n```\n\n\u003C\u002Fdiv>\n\n## 📊 Evaluation\n\n![image](.\u002Fassets\u002Fdata2.png)\n\n1. Download the provided [LISA-Grounding images](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fomlab\u002FVLM-R1\u002Fresolve\u002Fmain\u002Flisa-test.zip).\n\n```bash\ncd .\u002Fsrc\u002Feval\n\n# Remember to change the model path, image root, and annotation path in the script\ntorchrun --nproc_per_node=X test_rec_r1.py # for GRPO. 'X' is the number of GPUs you have.\ntorchrun --nproc_per_node=X test_rec_baseline.py # for SFT.\n```\n\n## 🔍 Ascend Inference\n\nWe have adapted the VLM-R1 series models to Huawei Ascend Atlas 800T A2 and Atlas 300I Duo series using the vllm-ascend framework. The specific adaptation and inference are as follows:\n\n- **Atlas 800T A2**: Please refer to [ascend_inference\u002F910B\u002Fvllm_ascend\u002FREADME.md](ascend_inference\u002F910B\u002Fvllm_ascend\u002FREADME.md)\n- **Atlas 300I Duo**: Please refer to [ascend_inference\u002F300IDuo\u002FREADME.md](ascend_inference\u002F300IDuo\u002FREADME.md)\n\n## 🤝 Acknowledgements\n\nWe would like to express our sincere gratitude to [DeepSeek](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1), [Open-R1](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fopen-r1), [QwenVL](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-VL), [Open-R1-Multimodal](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Fopen-r1-multimodal), [R1-V](https:\u002F\u002Fgithub.com\u002FDeep-Agent\u002FR1-V), [RefCOCO](https:\u002F\u002Fgithub.com\u002Flichengunc\u002Frefer), [RefGTA](https:\u002F\u002Fgithub.com\u002Fmikittt\u002Feasy-to-understand-REG\u002Ftree\u002Fmaster\u002Fpyutils\u002Frefer2), [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory), [OVDEval](https:\u002F\u002Fgithub.com\u002Fom-ai-lab\u002FOVDEval), [GUI-Testing-Arena](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsongjah\u002FGTArena-UI-Defects), and [LISA](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLISA) for providing open-source resources that contributed to the development of this project.\n\n## ⭐️ Citation\n\nIf you find this project useful, welcome to cite us.\n\n```bib\n@article{shen2025vlm,\n  title={Vlm-r1: A stable and generalizable r1-style large vision-language model},\n  author={Shen, Haozhan and Liu, Peng and Li, Jingcheng and Fang, Chunxin and Ma, Yibo and Liao, Jiajia and Shen, Qiaoli and Zhang, Zilun and Zhao, Kangjia and Zhang, Qianqian and Xu, Ruochen and Zhao, Tiancheng },\n  journal={arXiv preprint arXiv:2504.07615},\n  year={2025}\n}\n```\n","VLM-R1 是一个稳定且泛化的R1风格大规模视觉-语言模型，旨在通过强化学习提升视觉理解能力。该项目基于Qwen2.5-VL模型，采用R1和SFT方法进行训练，在领域内外数据上均表现出色，特别是在开放领域数据上的推理能力得到了显著增强。支持多种训练模式，包括全微调、冻结视觉模块、LoRA微调、多节点训练及多图像输入训练等，并提供了针对自定义数据集的解决方案。适用于需要高级视觉理解和跨模态交互的应用场景，如图像描述生成、视觉问答系统等。",2,"2026-06-11 03:40:13","high_star"]