[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72012":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},72012,"ml-fastvlm","apple\u002Fml-fastvlm","apple","This repository contains the official implementation of \"FastVLM: Efficient Vision Encoding for Vision Language Models\" - CVPR 2025","",null,"Python",7364,553,65,49,0,3,14,31,9,39.23,"Other",false,"main",[],"2026-06-12 02:02:57","# FastVLM: Efficient Vision Encoding for Vision Language Models\n\nThis is the official repository of\n**[FastVLM: Efficient Vision Encoding for Vision Language Models](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2412.13303). (CVPR 2025)**\n\n[\u002F\u002F]: # (![FastViTHD Performance]&#40;docs\u002Facc_vs_latency_qwen-2.png&#41;)\n\u003Cp align=\"center\">\n\u003Cimg src=\"docs\u002Facc_vs_latency_qwen-2.png\" alt=\"Accuracy vs latency figure.\" width=\"400\"\u002F>\n\u003C\u002Fp>\n\n### Highlights\n* We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.  \n* Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.\n* Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.\n* Demo iOS app to demonstrate the performance of our model on a mobile device.\n\n\u003Ctable>\n\u003Ctr>\n    \u003Ctd>\u003Cimg src=\"docs\u002Ffastvlm-counting.gif\" alt=\"FastVLM - Counting\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"docs\u002Ffastvlm-handwriting.gif\" alt=\"FastVLM - Handwriting\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"docs\u002Ffastvlm-emoji.gif\" alt=\"FastVLM - Emoji\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## Getting Started\nWe use LLaVA codebase to train FastVLM variants. In order to train or finetune your own variants, \nplease follow instructions provided in [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) codebase. \nWe provide instructions for running inference with our models.   \n\n### Setup\n```bash\nconda create -n fastvlm python=3.10\nconda activate fastvlm\npip install -e .\n```\n\n### Model Zoo\nFor detailed information on various evaluations, please refer to our [paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2412.13303).\n\n| Model        | Stage |                                            Pytorch Checkpoint (url)                                             |\n|:-------------|:-----:|:---------------------------------------------------------------------------------------------------------------:|\n| FastVLM-0.5B |   2   | [fastvlm_0.5b_stage2](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_0.5b_stage2.zip) |\n|              |   3   | [fastvlm_0.5b_stage3](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_0.5b_stage3.zip) |\n| FastVLM-1.5B |   2   | [fastvlm_1.5b_stage2](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_1.5b_stage2.zip) |\n|              |   3   | [fastvlm_1.5b_stage3](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_1.5b_stage3.zip)  |\n| FastVLM-7B   |   2   | [fastvlm_7b_stage2](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_7b_stage2.zip)  |\n|              |   3   | [fastvlm_7b_stage3](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_7b_stage3.zip)  |\n\nTo download all the pretrained checkpoints run the command below (note that this might take some time depending on your connection so might be good to grab ☕️ while you wait).\n\n```bash\nbash get_models.sh   # Files will be downloaded to `checkpoints` directory.\n```\n\n### Usage Example\nTo run inference of PyTorch checkpoint, follow the instruction below\n```bash\npython predict.py --model-path \u002Fpath\u002Fto\u002Fcheckpoint-dir \\\n                  --image-file \u002Fpath\u002Fto\u002Fimage.png \\\n                  --prompt \"Describe the image.\"\n```\n\n### Inference on Apple Silicon\nTo run inference on Apple Silicon, pytorch checkpoints have to be exported to format \nsuitable for running on Apple Silicon, detailed instructions and code can be found [`model_export`](model_export\u002F) subfolder.\nPlease see the README there for more details.\n\nFor convenience, we provide 3 models that are in Apple Silicon compatible format: [fastvlm_0.5b_stage3](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_0.5b_stage3_llm.fp16.zip), \n[fastvlm_1.5b_stage3](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_1.5b_stage3_llm.int8.zip), \n[fastvlm_7b_stage3](https:\u002F\u002Fml-site.cdn-apple.com\u002Fdatasets\u002Ffastvlm\u002Fllava-fastvithd_7b_stage3_llm.int4.zip). \nWe encourage developers to export the model of their choice with the appropriate quantization levels following \nthe instructions in [`model_export`](model_export\u002F).\n\n### Inference on Apple Devices\nTo run inference on Apple devices like iPhone, iPad or Mac, see [`app`](app\u002F) subfolder for more details.\n\n## Citation\nIf you found this code useful, please cite the following paper:\n```\n@InProceedings{fastvlm2025,\n  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},\n  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},\n  booktitle = {Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n  month = {June},\n  year = {2025},\n}\n```\n\n## Acknowledgements\nOur codebase is built using multiple opensource contributions, please see [ACKNOWLEDGEMENTS](ACKNOWLEDGEMENTS) for more details. \n\n## License\nPlease check out the repository [LICENSE](LICENSE) before using the provided code and\n[LICENSE_MODEL](LICENSE_MODEL) for the released models.\n","FastVLM 是一个高效视觉编码的多模态模型，旨在提升视觉语言模型处理高分辨率图像时的效率。该项目的核心功能是通过引入一种名为 FastViTHD 的新型混合视觉编码器来减少输出的 token 数量并显著缩短编码时间。技术上，它不仅在速度上远超同类模型（例如最小变体比 LLaVA-OneVision-0.5B 快 85 倍），而且在准确性方面也表现出色。此外，项目还提供了一个 iOS 演示应用，展示了模型在移动设备上的性能。FastVLM 特别适合需要快速准确处理图像与文本结合任务的应用场景，如实时图像识别、内容生成等。",2,"2026-06-11 03:39:56","high_star"]