[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80955":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":11,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":14,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":15,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":19,"hasPages":17,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":11,"lastSyncTime":24,"discoverSource":25},80955,"LLaVA-UHD-v4","THUMAI-Lab\u002FLLaVA-UHD-v4","THUMAI-Lab",null,"Python",33,2,31,0,3,9,1.43,false,"main",true,[],"2026-06-12 02:04:09","\u003Cdiv align=\"center\">\n\n# LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.08985) [![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLLaVA--UHD%20v4-000000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002FTHUMAI-Lab\u002FLLaVA-UHD-v4) [![HF Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.08985) [![Checkpoint (Ours)](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF--Checkpoint%20(Ours)-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002FPhoenixGS\u002FLLaVA-UHD-v4-8M) [![Checkpoint (Post-ViT)](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF--Checkpoint%20(Post--ViT)-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002FPhoenixGS\u002FLLaVA-UHD-v4-8M-Baseline) [![MS Checkpoint (Ours)](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope--Checkpoint%20(Ours)-624AFF?style=for-the-badge)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPhoenixGS\u002FLLaVA-UHD-v4-8M) [![MS Checkpoint (Post-ViT)](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope--Checkpoint%20(Post--ViT)-624AFF?style=for-the-badge)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPhoenixGS\u002FLLaVA-UHD-v4-8M-Baseline)\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n  \u003Cp>\n    \u003Ca href=\"#-news\">🎉 News\u003C\u002Fa> •\n    \u003Ca href=\"#-introduction\">📖 Introduction\u003C\u002Fa> •\n    \u003Ca href=\"#-performance\">📊 Performance\u003C\u002Fa> •\n    \u003Ca href=\"#-architecture\">🏗️ Architecture\u003C\u002Fa> •\n    \u003Ca href=\"#-evaluation\">🧪 Evaluation\u003C\u002Fa> •\n    \u003Ca href=\"#-citation\">🎈 Citation\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## 🎉 News\n\n- **[2026, May 24]** Evaluation code and model checkpoints are available.\n- **[Coming Soon]** Training code will be released before **June 7**.\n\n## 📖 Introduction\n\nThis repository hosts the code and model weights of **LLaVA-UHD v4**, a multimodal large language model (MLLM) designed for efficient high-resolution visual encoding. LLaVA-UHD v4 rethinks the conventional global-encoding-plus-post-ViT-compression paradigm and introduces a **slice-based encoding framework with intra-ViT early compression**. By moving token reduction into shallow ViT layers, our model substantially reduces the computational cost of visual encoding while preserving fine-grained perception ability.\n\nAcross eight standard benchmarks covering document understanding, OCR, mathematical reasoning, and general VQA, LLaVA-UHD v4 matches or even surpasses a post-ViT compression baseline under the same 16× final compression ratio, while **reducing visual-encoding FLOPs by 55.8%**. These results demonstrate that aggressive token compression can be performed inside the vision encoder without sacrificing downstream performance, offering a practical path toward scalable high-resolution MLLMs.\n\n## 📊 Performance\n\n\u003Cp align=\"center\">\n   \u003Cimg src=\".\u002Ffigures\u002Fscaling_and_flops.png\" alt=\"Scaling behavior and FLOPs comparison\" style=\"width: 85%;\">\n\u003C\u002Fp>\n\nThe figure above highlights the core efficiency–performance trade-off of LLaVA-UHD v4. Across training scales from 4M to 64M samples, LLaVA-UHD v4 closely tracks the performance of the strong post-ViT compression baseline, indicating that intra-ViT early compression preserves the model's scaling behavior. At the same time, by moving part of the token reduction into the vision encoder, LLaVA-UHD v4 reduces visual-encoding FLOPs from **3555G to 1573G**, achieving a **55.8% reduction** in computation.\n\n\u003Cp align=\"center\">\n   \u003Cimg src=\".\u002Ffigures\u002Ftable.png\" alt=\"Benchmark results\" style=\"width: 85%;\">\n\u003C\u002Fp>\n\n## 🏗️ Architecture\n\n\u003Cp align=\"center\">\n   \u003Cimg src=\".\u002Ffigures\u002Farchitecture.png\" alt=\"LLaVA-UHD v4 architecture\" style=\"width: 85%;\">\n\u003C\u002Fp>\n\nUnlike previous high-resolution MLLMs that encode the full image globally and compress visual tokens only after the ViT, LLaVA-UHD v4 adopts **slice-based encoding** and moves part of the compression directly into the vision encoder. The intra-ViT compressor first performs local window attention to aggregate neighboring visual information, then applies pixel-unshuffle and MLP-based fusion to reduce the token count. As a result, the remaining ViT layers operate on a much shorter visual sequence, substantially lowering the cost of high-resolution visual encoding while maintaining strong fine-grained perception.\n\n## 🧪 Evaluation\n\n### 1) Prepare environment\n\n```bash\ncd vlmevalkit\n# Use your own virtual environment path\nsource \u002Fpath\u002Fto\u002Fvenv\u002Fbin\u002Factivate\npip install -r requirements.txt\n```\n\nIf you want `run_eval.sh` to auto-activate your environment, set:\n\n```bash\nexport VENV_PATH=\u002Fpath\u002Fto\u002Fvenv\n```\n\nNote: some benchmarks require an LLM judge; set `OPENAI_API_KEY` before evaluation.  \nIf needed, you can also set `OPENAI_API_BASE` (or `OPENAI_API_KEY_JUDGE` \u002F `OPENAI_API_BASE_JUDGE`).\n\n### 2) Run evaluation\n\n```bash\ncd vlmevalkit\n\nexport MODEL_PATH=\u002Fpath\u002Fto\u002Fmodel_or_checkpoint\nexport MODEL_NAME=MiniCPM_4_V\nexport DATASETS=\"MMMU_DEV_VAL MathVista_MINI MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench\"\nexport SAVE_NAME=llava_uhd_v4_eval\n\n# Optional settings\nexport SAVE_ROOT=\u002Fpath\u002Fto\u002Fsave\u002Froot\nexport GPU_NUM=8\n\nbash .\u002Fscripts\u002Frun_eval.sh \"$MODEL_PATH\" \"$MODEL_NAME\" \"$DATASETS\" \"$SAVE_NAME\"\n```\n\n## 🎈 Citation\n\nIf you find LLaVA-UHD v4 helpful, please cite us.\n\n```bibtex\n@misc{fang2026llavauhdv4makesefficient,\n      title={LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?}, \n      author={Kechen Fang and Yihua Qin and Chongyi Wang and Wenshuo Ma and Tianyu Yu and Yuan Yao},\n      year={2026},\n      eprint={2605.08985},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.08985}, \n}\n```\n","LLaVA-UHD v4 是一个专为高效高分辨率视觉编码设计的多模态大语言模型。该项目通过引入基于切片的编码框架和ViT内部早期压缩机制，将token减少操作移至浅层ViT中，从而显著降低了视觉编码的计算成本，同时保持了精细感知能力。根据项目介绍，在文档理解、OCR、数学推理及一般VQA等八个标准基准测试中，LLaVA-UHD v4不仅在相同的16倍最终压缩比下与后ViT压缩基线性能相当甚至超越，还减少了55.8%的视觉编码FLOPs。因此，该项目非常适合需要处理大规模高分辨率图像数据的应用场景，如图像密集型AI服务或研究项目。","2026-06-11 04:03:00","CREATED_QUERY"]