[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72126":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":30,"discoverSource":31},72126,"nanoVLM","huggingface\u002FnanoVLM","huggingface","The simplest, fastest repository for training\u002Ffinetuning small-sized VLMs.","",null,"Python",4894,493,28,35,0,2,12,34,6,73.48,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:03","# nanoVLM\n\n![nanoVLM](assets\u002FnanoVLM.png)\n\n\u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fhuggingface\u002FnanoVLM\u002Fblob\u002Fmain\u002FnanoVLM.ipynb\">\n  \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\"\u002F>\n\u003C\u002Fa>\n\n---\n\n> [!TIP]\n> We have written a [tutorial on nanoVLM](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fnanovlm) which will guide you through the repository and help you get started in no time.\n\n---\n\n> [!NOTE]\n> We have pushed some more breaking changes on September 9, 2025. These are all the updates to use image splitting and train on multiple nodes. This was used for the ablations of the FineVision release. Some things in the codebase regarding support scripts (eg. the notebook, or memory evals) are propably not working anymore. Similarly to the older trained versions of nanoVLM (similarly to Note below). If you find something that doesn't work anymore please let us know in the Issues or submit a PR!\n\n---\n\n> [!NOTE]\n> We have pushed some breaking changes to the repository on June 4, 2025. To enable us to do smarter packing, we refactored the way image and text embeddings are combined. To keep everything as smooth as possible, we have trained a new nanoVLM-450M with this new pipeline, while leaving the old nanoVLM-222M compatible with the old pipeline If you clone this repository now or pull the updated to your local machine, the default will be the new 450M Model. If you would like a simpler understanding and a simpler codebase, you can use the v0.1 release. This works out of the box with the old 222M model.\n\n---\n\nnanoVLM is the simplest repository for training\u002Ffinetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch. The code itself is very readable and approachable, the model consists of a Vision Backbone (`models\u002Fvision_transformer.py` ~150 lines), Language Decoder (`models\u002Flanguage_model.py` ~250 lines), Modality Projection (`models\u002Fmodality_projection.py` ~50 lines) and the VLM itself (`models\u002Fvision_language_model.py` ~100 lines) and a simple training loop (`train.py` ~200 lines).\n\nSimilar to Andrej Karpathy's nanoGPT, we wanted to equip the community with a very simple implementation and training script for Vision Language Models. We do not claim this to be a new SOTA model, rather an educational effort that packs quite a bit of punch if you have the right hardware! You should be able to tweak and play around with the code in no time.\n\n\n## What can nanoVLM do?\n\nThe model definition and training logic of this repository fits in ~750 lines, with some more boilerplate logging and parameter loading. \nUsing the [`SigLIP-B\u002F16-224-85M`](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-base-patch16-224) and [`HuggingFaceTB\u002FSmolLM2-135M`](https:\u002F\u002Fhuggingface.co\u002FHuggingFaceTB\u002FSmolLM2-135M) as backbones results in a **222M** nanoVLM. Training this for ~6h on a single H100 GPU on ~1.7M samples of [the cauldron](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceM4\u002Fthe_cauldron) results in an accuracy of 35.3% on MMStar.\n\n![loss](assets\u002FnanoVLM-222M-loss.png)\n\nIt is therefore a simple yet powerful platform to get started with VLMs. Perfect to tinker around with different setups and settings, to explore the capabilities and efficiencies of small VLMs!\n\n## Quick Start\n\nYou can either clone the repository, setup an environment and start with the scripts, or directly [open in Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fhuggingface\u002FnanoVLM\u002Fblob\u002Fmain\u002FnanoVLM.ipynb). You can also use the [interactive notebook](.\u002FnanoVLM.ipynb) to get started!\n\n\n## Environment Setup\n\nWe really like `uv` and recommend using it as your package manager. But feel free to use whichever you prefer.\n\nLet's first clone the repository:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002FnanoVLM.git\ncd nanoVLM\n```\n\nIf you want to use `uv`:\n```bash\nuv init --bare --python 3.12\nuv sync --python 3.12\nsource .venv\u002Fbin\u002Factivate\nuv add torch numpy torchvision pillow datasets huggingface-hub transformers wandb\n# Optional: for lmms-eval integration you have to install it from source, see section 'Evaluation with lmms-eval'\n```\n\nIf you prefer another environment manager, simply install these packages:  \n```bash\npip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb\n# Optional: for lmms-eval integration you have to install it from source, see section 'Evaluation with lmms-eval'\n\n```\nDependencies: \n- `torch` \u003C3\n- `numpy` \u003C3\n- `torchvision` for the image processors\n- `pillow` for image loading\n- `datasets` for the training datasets\n- `huggingface-hub` & `transformers` to load the pretrained backbones\n- `wandb` for logging\n\n## Training\n\nTo train nanoVLM, you can simply use the provided training script. After training, your model gets uploaded to the Hub!\n```bash\nwandb login --relogin\nhuggingface-cli login\npython train.py\n```\nwhich will use the default `models\u002Fconfig.py`.\n\n## Generate\n\nTo try a [trained model](https:\u002F\u002Fhuggingface.co\u002Flusxvr\u002FnanoVLM-450M), you can simply use the provided generate script\n```bash\npython generate.py\n```\nor, to use your own trained model, you can simply run:\n```bash\npython generate.py --checkpoint \u002Fyour\u002Fpath\u002Fto\u002Ftrained_models\n```\n\nIf we feed the example image in `assets\u002Fimage.png` with a question into the model, we get the following output. Even after only short training, the model can recognize the cat in the picture. \n```\nInput: \nImage + 'What is this?'\n\nOutputs:\nGeneration 1:  This is a cat sitting on the ground. I think this is a cat sitting on the ground.\nGeneration 2:  This picture is clicked outside. In the center there is a brown color cat seems to be sitting on\nGeneration 3:  This is a cat sitting on the ground, which is of white and brown in color. This cat\nGeneration 4:  This is a cat sitting on the ground. I think this is a cat sitting on the ground.\nGeneration 5:  This is a cat sitting on the ground, which is covered with a mat. I think this is\n```\n\n### Evaluation with lmms-eval\n\nnanoVLM now supports evaluation using the comprehensive [lmms-eval](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval) toolkit:\n\n```bash\n# Install lmms-eval (has to be from source)\nuv pip install git+https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval.git\n\n# Make sure you have your environment variables set correctly and you are logged in to HF\nexport HF_HOME=\"\u003CPath to HF cache>\"\nhuggingface-cli login\n\n# Evaluate a trained model on multiple benchmarks\npython evaluation.py --model lusxvr\u002FnanoVLM-450M --tasks mmstar,mme\n\n# If you want to use it during training, simply import the module and call it just as you would from the command line.\n# You can pass all the arguments you can also pass in the command line.\n# The evaluation during training works in the full DDP setup.\nfrom evaluation import cli_evaluate\nargs = argparse.Namespace(\n    model='lusxvr\u002FnanoVLM-450M', # This can be either a checkpoint path or the model itself\n    tasks='mmstar,mmmu,ocrbench',\n    batch_size=128 # Adapt this to your GPU, needs to be passed to avoid an OOM Error\n)\nresults = cli_evaluate(args)\n```\n\n## Hub integration\n\n**nanoVLM** comes with handy methods to load and save the model from the Hugging Face Hub.\n\n### Pretrained weights\n\nHere is how to load from a repo on the Hugging Face Hub. This is the recommended way to start working with the pretrained weights.\n\n```python\n# Load pretrained weights from Hub\nfrom models.vision_language_model import VisionLanguageModel\n\nmodel = VisionLanguageModel.from_pretrained(\"lusxvr\u002FnanoVLM-450M\")\n```\n\n### Push to hub\n\nOnce you've trained a **nanoVLM** model, you might want to share it on the Hugging Face Hub. You can easily do that with:\n\n```python\n... # Load and train your model\n\n# Push it to `username\u002Fmy-awesome-nanovlm-model` repo\nmodel.push_to_hub(\"my-awesome-nanovlm-model\")\n```\n\nThe model will be saved on the Hub as a config file `config.json` and a weights file `model.safetensors`. A modelcard `README.md` will also be generated for you with some high-level information. Feel free to update it manually to explain your work.\n\nIf the repo does not exist, it will be created for you. By default the repo will be public. You can pass `private=True` if you don't want to share publicly.\n\n\n### Local save\u002Fload\n\nIf you don't want to host your model on the Hugging Face Hub, it is still possible to save it locally:\n\n```python\n... # Load and train your model\n\n# Save it to a local folder\nmodel.save_pretrained(\"path\u002Fto\u002Flocal\u002Fmodel\")\n```\n\nYou can then reload it from the local path:\n\n```python\n# Load pretrained weights from local path\nfrom models.vision_language_model import VisionLanguageModel\n\nmodel = VisionLanguageModel.from_pretrained(\"path\u002Fto\u002Flocal\u002Fmodel\")\n```\n\n## VRAM Usage\n\nUnderstanding the VRAM requirements for training is crucial for selecting the right hardware and batch sizes. We've benchmarked the default `nanoVLM` model (222M parameters) on a single NVIDIA H100 GPU. Below is a summary of the peak VRAM usage observed for different batch sizes during training (including model, gradients, and optimizer states):\n\n\u003Cimg src=\"assets\u002FVRAM_Usage_vs_Batch_Size_nanoVLM.png\" width=\"600\" alt=\"VRAM Usage vs Batch Size\">\n\nHere's a breakdown of the approximate peak VRAM usage:\n\n```\nVRAM allocated after loading model to device: 871.44 MB\n--- Summary of VRAM Usage ---\nBatch Size 1: 4448.58 MB\nBatch Size 2: 4465.39 MB\nBatch Size 4: 4532.29 MB\nBatch Size 8: 5373.46 MB\nBatch Size 16: 7604.36 MB\nBatch Size 32: 12074.31 MB\nBatch Size 64: 20995.06 MB\nBatch Size 128: 38834.19 MB\nBatch Size 256: 74561.08 MB\nBatch Size 512: OOM (Peak before OOM: 80247.67 MB)\n```\n\nNote that the VRAM measurement was performed on a small setup using 'SmolLM2-135M' with a maximum input sequence length of 128 tokens. This may differ from the current default configuration in the project.\n\n**Key Takeaways:**\n- You'll need at least ~4.5 GB of VRAM to train the default model even with a batch size of 1.\n- With approximately 8 GB of VRAM, you should be able to train with a batch size of up to 16.\n\n**Measure for Your Setup:**\n\nThe values above are for the default model configuration. If you modify the model architecture (e.g., change backbones, hidden sizes) or use different sequence lengths, your VRAM requirements will change. \n\nWe provide a script `measure_vram.py` that allows you to test VRAM requirements on your specific machine and for your chosen model configuration and batch sizes. \n\nTo use it:\n1. Ensure you have a CUDA-enabled GPU and PyTorch installed.\n2. Run the script with your desired batch sizes. You can also specify a model checkpoint if you have one, or let it initialize a new model based on the default `VLMConfig`.\n\n```bash\n# Example: Test batch sizes 1, 2, 4, 8 with a new default model\npython measure_vram.py --batch_sizes \"1 2 4 8\"\n\n# Example: Test with a specific checkpoint and different batch sizes\npython measure_vram.py --vlm_checkpoint_path path\u002Fto\u002Fyour\u002Fmodel.pth --batch_sizes \"16 32 64\"\n\n```\n\nThis script will output the peak VRAM allocated for each batch size tested, helping you determine feasible training configurations for your hardware.\n\n\n## Contributing\n\nWe welcome contributions to nanoVLM! However, to maintain the repository's focus on simplicity and pure PyTorch, we have a few guidelines:\n\n*   **Pure PyTorch:** We aim to keep nanoVLM as a lightweight implementation in pure PyTorch. Contributions that introduce dependencies like `transformers.Trainer`, `accelerate`, or `deepspeed` will not be accepted.\n*   **New Features:** If you have an idea for a new feature, please open an issue first to discuss the scope and implementation details. This helps ensure that your contribution aligns with the project's goals.\n*   **Bug Fixes:** Feel free to submit pull requests for bug fixes.\n\n### Roadmap\n\nHere are some areas we're looking to work on in the near future. Contributions in these areas are particularly welcome:\n\n*   **Evaluations:** Implementing more evaluations or improving our MMStar implementation (highly valued)\n*   **Data Packing:** Implementing a way to create packs of a given size from the input data to optimize training.\n*   **Multi-gpu training:** Training on several GPUs\n*   **Multi-image support:** Training with several images\n*   **Image-splitting:** Enabling higher resolutions through image-splitting as done in SmolVLM.\n*   **VLMEvalKit:** Integration into [VLMEvalKit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit) to enable further benchmarks\n\n## Citation\n\nIf you like the project and want to use it somewhere, please use this citation:\n```\n@misc{wiedmann2025nanovlm,\n  author = {Luis Wiedmann and Aritra Roy Gosthipaty and Andrés Marafioti},\n  title = {nanoVLM},\n  year = {2025},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fhuggingface\u002FnanoVLM}}\n}\n```\n","nanoVLM 是一个用于训练和微调小型视觉-语言模型的简洁快速仓库。项目采用纯 PyTorch 实现，代码可读性强且易于理解，核心组件包括视觉骨干、语言解码器、模态投影以及视觉-语言模型本身，整个训练逻辑仅需约750行代码。通过使用预定义的视觉和语言模型作为基础，用户可以轻松地在单个H100 GPU上完成对模型的训练或微调。该项目非常适合需要快速实验视觉-语言任务的研究者或开发者，尤其是在资源有限的情况下寻求高效解决方案的场景。","2026-06-11 03:40:29","high_star"]