[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72572":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},72572,"LLaVA-CoT","PKU-YuanGroup\u002FLLaVA-CoT","PKU-YuanGroup","[ICCV 2025] LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning","",null,"Python",2137,82,62,0,1,3,27.76,"Apache License 2.0",false,"main",[],"2026-06-12 02:03:05","\u003Cdiv align=center>\n\u003Cimg src=\"figures\u002Flogo.png\" width=\"280px\">\n\u003C\u002Fdiv>\n\n\u003Ch2 align=\"center\"> \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10440\">LLaVA-CoT: Let Vision Language Models Reason Step-by-Step\u003C\u002Fa>\u003C\u002Fh2>\n\n\u003Ch5 align=\"center\"> If you like our project, please give us a star ⭐ on GitHub for the latest update.\u003C\u002Fh5>\n\n\u003Ch5 align=center>\n\n[![hf](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Hugging%20Face-blue.svg)](https:\u002F\u002Fhuggingface.co\u002FXkev\u002FLlama-3.2V-11B-cot)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2411.10440-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10440)\n[![X (formerly Twitter) URL](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl?url=https%3A%2F%2Fx.com%2FKevin_GuoweiXu%2Fstatus%2F1858338565463421244)](https:\u002F\u002Fx.com\u002FKevin_GuoweiXu\u002Fstatus\u002F1858338565463421244)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache2.0-yellow)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLLaVA-CoT\u002Fblob\u002Fmain\u002FLICENSE)\n[![Replicate](https:\u002F\u002Freplicate.com\u002Fchenxwh\u002Fllava-cot\u002Fbadge)](https:\u002F\u002Freplicate.com\u002Fchenxwh\u002Fllava-cot)\n[![Wisemodel](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fwisemodel-code-blue)](https:\u002F\u002Fwisemodel.cn\u002Fcodes\u002FKevinTHU\u002FLLaVA-CoT)\n\n\u003C\u002Fh5>\n\n\u003Ch5 align=\"center\">本项目希望构建开源多模态慢思考推理大模型，由北大深研院袁粒老师课题组研发。\u003C\u002Fh5>\n\n## 📣 News\n\n- **[2025\u002F07\u002F01]** LLaVA-CoT has been accepted by ICCV 2025!\n- **[2025\u002F04\u002F13]** We have updated the inference script that does not rely on VLMEvalKit, based on the suggestion in [this issue](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLLaVA-CoT\u002Fissues\u002F34).\n- **[2025\u002F01\u002F08]** We released the full training code.\n- **[2025\u002F01\u002F02]** We discovered that when testing with the AI2D benchmark, we were using AI2D_TEST_NO_MASK, while the VLMEvalKit utilizes AI2D_TEST. We previously overlooked the distinction between the two, and we sincerely apologize for this oversight. We will make the necessary corrections.\n- **[2024\u002F11\u002F28]** We've released the dataset: [https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FXkev\u002FLLaVA-CoT-100k](\u003C[dataset_generation\u002Fgenerate.py](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FXkev\u002FLLaVA-CoT-100k)>)\n- **[2024\u002F11\u002F25]** We've released the code for dataset generation: [dataset_generation\u002Fgenerate.py](dataset_generation\u002Fgenerate.py)\n- **[2024\u002F11\u002F23]** We've released the Gradio App: [https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FXkev\u002FLlama-3.2V-11B-cot](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FXkev\u002FLlama-3.2V-11B-cot)\n- **[2024\u002F11\u002F20]** We've released the pretrained weights: [https:\u002F\u002Fhuggingface.co\u002FXkev\u002FLlama-3.2V-11B-cot](https:\u002F\u002Fhuggingface.co\u002FXkev\u002FLlama-3.2V-11B-cot)\n- **[2024\u002F11\u002F18]** We've released our paper: [https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10440](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10440)\n- **[2024\u002F11\u002F18]** Welcome to **watch** 👀 this repository for the latest updates.\n\n## 🔥 Highlights\n\n**LLaVA-CoT** is a visual language model capable of spontaneous, systematic reasoning.\n\nOur 11B model outperforms **Gemini-1.5-pro**,**GPT-4o-mini**, and **Llama-3.2-90B-Vision-Instruct** on six challenging multimodal benchmarks.\n\n\u003Cdiv align=center>\n\u003Cimg src=\"figures\u002Fresult.png\" width=\"300px\">\n\u003C\u002Fdiv>\n\n## 🚀 Demos\n\nLLaVA-CoT begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.\n\n### Reasoning Problems\n\n| **Question**                      | \u003Cimg src=\"figures\u002Freasoning.png\" width=\"400\"> \u003Cbr> Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |\n| --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| **Llama-3.2-11B-Vision-Instruct** | To determine the correct answer, we need to follow the instructions provided. \u003Cbr> The first instruction is to subtract all tiny shiny balls. \u003Cbr> Since there are \u003Cspan style=\"color: red;\">three tiny shiny balls\u003C\u002Fspan> in the image, we would be left with no shiny balls after performing this action. \u003Cbr> The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are \u003Cspan style=\"color: red;\">no shiny purple objects\u003C\u002Fspan> present in the image. Therefore, we cannot subtract any purple objects. \u003Cbr> Given that we have already removed all the tiny shiny balls and there were no purple objects to begin with, the correct answer is \u003Cspan style=\"color: red;\">C. 2\u003C\u002Fspan>, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |\n| **LLaVA-CoT**                     | **_Summary stage: What's the problem? What should I do?_** \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;SUMMARY&gt;\u003C\u002Fspan> I will examine the image to identify and count all the objects, focusing on the \u003Cspan style=\"color: green;\">tiny shiny balls and purple objects\u003C\u002Fspan>. Then, I will subtract these objects from the total count and determine the number of remaining objects. \u003Cspan style=\"color: gray;\">&lt;\u002FSUMMARY&gt;\u003C\u002Fspan> \u003Cbr>**_Caption stage: What can I know from the image?_** \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;CAPTION&gt;\u003C\u002Fspan> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are \u003Cspan style=\"color: green;\">small shiny green spheres\u003C\u002Fspan> and \u003Cspan style=\"color: green;\">a purple cylinder\u003C\u002Fspan>. \u003Cspan style=\"color: gray;\">&lt;\u002FCAPTION&gt;\u003C\u002Fspan> \u003Cbr> **_Reasoning stage: How to solve the problem step-by-step?_** \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;REASONING&gt;\u003C\u002Fspan> First, I will count all the objects in the image: there are \u003Cspan style=\"color: green;\">10 objects in total\u003C\u002Fspan>. Next, I will identify and subtract the tiny shiny balls. There is \u003Cspan style=\"color: green;\">one small shiny green sphere\u003C\u002Fspan>. Then, I will identify and subtract the purple objects. There is \u003Cspan style=\"color: green;\">one purple cylinder\u003C\u002Fspan>. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that \u003Cspan style=\"color: green;\">8 objects remain\u003C\u002Fspan>. \u003Cspan style=\"color: gray;\">&lt;\u002FREASONING&gt;\u003C\u002Fspan> \u003Cbr> **_Conclusion stage: What is the final answer?_** \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;CONCLUSION&gt;\u003C\u002Fspan> **B** \u003Cspan style=\"color: gray;\">&lt;\u002FCONCLUSION&gt;\u003C\u002Fspan> |\n\n### Science Problems\n\n| **Question**                      | \u003Cimg src=\"figures\u002Fscience.png\" width=\"400\"> \u003Cbr> An acetal polymer block is fixed to the rigid plates at its top and bottom surfaces. If the top plate displaces 2 mm horizontally when it is subjected to a horizontal force \\( P = 2 \\, \\text{kN} \\), determine the shear modulus of the polymer. The width of the block is 100 mm. Assume that the polymer is linearly elastic and use small angle analysis. \u003Cbr> **Options:** A: 15 MPa, B: 5 MPa, C: 20 MPa, D: 2 MPa                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n| --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| **Llama-3.2-11B-Vision-Instruct** | To determine the shear modulus of the polymer, we can use the formula for shear modulus in a small angle analysis: \u003Cbr> \\( G = \\frac{F}{A \\gamma} \\) \u003Cbr> where \\( G \\) is the shear modulus, \\( F \\) is the applied force, \\( A \\) is the area of the block, and \\( \\gamma \\) is the shear strain. \u003Cbr> Given that the top plate of the block displaces horizontally by 2 mm when subjected to a horizontal force of 2 kN, we can calculate the shear strain \\( \\gamma \\). Since the displacement is horizontal and \u003Cspan style=\"color: red;\">the original length of the block is 400 mm\u003C\u002Fspan>, the shear strain \\( \\gamma \\) is: \u003Cbr> \\[ \\gamma = \\frac{\\Delta x}{L} = \\frac{2 \\, \\text{mm}}{400 \\, \\text{mm}} = 0.005 \\] \u003Cbr> Next, we calculate the area \\( A \\) of the block. The width of the block is given as 100 mm, and assuming \u003Cspan style=\"color: red;\">the height of the block is also 200 mm\u003C\u002Fspan> (as given in the image analysis), the area \\( A \\) is: \u003Cbr> \\[ A = 100 \\, \\text{mm} \\times 200 \\, \\text{mm} = 20000 \\, \\text{mm}^2 \\] \u003Cbr> Now, we can substitute these values into the formula for shear modulus: \u003Cbr> \\[ G = \\frac{F}{A \\gamma} = \\frac{2000 \\, \\text{N}}{20000 \\, \\text{mm}^2 \\times 0.005} = \\frac{2000}{100} = 20 \\, \\text{MPa} \\] \u003Cbr> Therefore, the correct answer is \u003Cspan style=\"color: red;\">C: 20 MPa\u003C\u002Fspan>.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |\n| **LLaVA-CoT**                     | **_Summary stage: What's the problem? What should I do?_** \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;SUMMARY&gt;\u003C\u002Fspan> To solve the problem, I will \u003Cspan style=\"color: green;\">determine the shear modulus\u003C\u002Fspan> of the acetal polymer block using the relationship between the applied force, displacement, and dimensions of the block. I will \u003Cspan style=\"color: green;\">use the formula for shear modulus\u003C\u002Fspan> and calculate the result. \u003Cspan style=\"color: gray;\">&lt;\u002FSUMMARY&gt;\u003C\u002Fspan> \u003Cbr> **_Caption stage: What can I know from the image?_** \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;CAPTION&gt;\u003C\u002Fspan> The image shows a rectangular block of acetal polymer fixed between two rigid plates. The top plate moves horizontally by 2 mm when a force \\( P = 2 \\, \\text{kN} \\) is applied. The block's dimensions are given as \u003Cspan style=\"color: green;\">400 mm in length, 200 mm in height\u003C\u002Fspan>, and 100 mm in width. \u003Cspan style=\"color: gray;\">&lt;\u002FCAPTION&gt;\u003C\u002Fspan> \u003Cbr> **_Reasoning stage: How to solve the problem step-by-step?_** \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;REASONING&gt;\u003C\u002Fspan> The shear modulus \\( G \\) of the polymer can be calculated using the formula: \u003Cbr> \\[ G = \\frac{F}{A \\cdot \\frac{\\Delta x}{L}} \\] \u003Cbr> where: \u003Cbr> - \\( F \\) is the applied force (2 kN), \u003Cbr> - \\( A \\) is the area of the block's top surface (\u003Cspan style=\"color: green;\">100 mm x 400 mm = 40,000 mm²\u003C\u002Fspan>), \u003Cbr> - \\( \\Delta x \\) is the horizontal displacement (2 mm), \u003Cbr> - \\( L \\) is the height of the block (\u003Cspan style=\"color: green;\">200 mm\u003C\u002Fspan>). \u003Cbr> First, convert the force to Newtons: \\( F = 2 \\, \\text{kN} = 2000 \\, \\text{N} \\). \u003Cbr> Next, calculate the shear modulus: \u003Cbr> \\[ G = \\frac{2000}{40,000 \\cdot \\frac{2}{200}} \\] \u003Cbr> \\[ G = \\frac{2000}{40,000 \\cdot 0.01} \\] \u003Cbr> \\[ G = \\frac{2000}{400} \\] \u003Cbr> \\[ G = 5 \\, \\text{MPa} \\] \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;\u002FREASONING&gt;\u003C\u002Fspan> \u003Cbr> **_Conclusion stage: What is the final answer?_** \u003Cbr> \u003Cspan style=\"color: gray;\">&lt;CONCLUSION&gt;\u003C\u002Fspan> \u003Cspan style=\"color: green;\">B: 5 MPa\u003C\u002Fspan> \u003Cspan style=\"color: gray;\">&lt;\u002FCONCLUSION&gt;\u003C\u002Fspan> |\n\n## 🛠️ Usage\n\n### Pretrained weights\n\nYou can download the pretrained weights from the Huggingface: [Xkev\u002FLlama-3.2V-11B-cot](https:\u002F\u002Fhuggingface.co\u002FXkev\u002FLlama-3.2V-11B-cot).\n\n### Dataset\n\nYou can download the dataset from the Huggingface: [Xkev\u002FLLaVA-CoT-100k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FXkev\u002FLLaVA-CoT-100k).\n\n### Inference\n\nYou can use the same code as Llama-3.2-11B-Vision-Instruct to load the model and perform inference.\n\nIf you want to use perform inference time scaling, you can refer to the detailed instructions provided in [this file](inference\u002FREADME.md).\n\n### Finetuning\n\nYou may use any repository that supports Llama-3.2-11B-Vision-Instruct for finetuning.\n\nWe recommend using [llama-recipes](https:\u002F\u002Fgithub.com\u002FMeta-Llama\u002Fllama-recipes).\n\nTo reproduce our results, you can use the following command:\n\n```bash\ncd train\npip install llama-recipes\ntorchrun --nnodes 1 --nproc_per_node 8 --master_port 29500 finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 4 --model_name meta-llama\u002FLlama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder .\u002Ffinetuned_model --dist_checkpoint_folder LLaVA-CoT --use_fast_kernels --dataset \"custom_dataset\" --custom_dataset.test_split \"test\" --custom_dataset.file \"datasets\u002Fcot_dataset.py\"  --run_validation False --batching_strategy padding\n```\n\nRemember to modify the `data_path` and `image_base_path` in `train\u002Fcot_dataset.py` to your own path (the path to the training dataset).\n\n## 📝 Citation\n\nIf you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:\n\n```\n@InProceedings{Xu_2025_ICCV,\n    author    = {Xu, Guowei and Jin, Peng and Wu, Ziang and Li, Hao and Song, Yibing and Sun, Lichao and Yuan, Li},\n    title     = {LLaVA-CoT: Let Vision Language Models Reason Step-by-Step},\n    booktitle = {Proceedings of the IEEE\u002FCVF International Conference on Computer Vision (ICCV)},\n    month     = {October},\n    year      = {2025},\n    pages     = {2087-2098}\n}\n```\n\n## 🙏 Acknowledgement\n\n- The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi\u002Fblob\u002Fmain\u002FLICENSE) file.\n- The service is a research preview intended for non-commercial use only, subject to [LLAMA 3.2 COMMUNITY LICENSE AGREEMENT](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.2-1B\u002Fblob\u002Fmain\u002FLICENSE.txt), and [Terms of Use](https:\u002F\u002Fopenai.com\u002Fpolicies\u002Fterms-of-use) of the data generated by OpenAI. Please contact us if you find any potential violations.\n- The template is modified from [Chat-Univi](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi) and [LLaVA](https:\u002F\u002Fllava-vl.github.io\u002F).\n","LLaVA-CoT是一个能够进行自发、系统性推理的视觉语言模型。该项目的核心功能在于通过多步骤推理来理解和解决复杂的多模态任务，使用Python开发，并基于110亿参数的大模型，在六个具有挑战性的多模态基准测试中表现优于Gemini-1.5-pro、GPT-4o-mini和Llama-3.2-90B-Vision-Instruct等模型。LLaVA-CoT适合应用于需要高级视觉理解与文本交互能力的场景，如图像描述生成、复杂问题解答以及基于图像的内容分析等领域。项目遵循Apache License 2.0开源协议发布，支持社区贡献与扩展。",2,"2026-06-11 03:42:37","high_star"]