[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72224":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72224,"VILA","NVlabs\u002FVILA","NVlabs","VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.","",null,"Python",3817,322,43,67,0,5,12,24,15,29.53,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:00","\n\n# VILA: Optimized Vision Language Models\n\n[![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg)](CODE_LICENSE)\n[![Model License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMODEL%20License-CC%20By%20NC%204.0-red.svg)](MODEL_LICENSE)\n[![Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002Frelease\u002Fpython-3100\u002F)\n\n[arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04468) \u002F [Demo](https:\u002F\u002Fvila.hanlab.ai\u002F) \u002F [Models](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FEfficient-Large-Model\u002Fnvila-674f8163543890b35a91b428) \u002F [Subscribe](https:\u002F\u002Fforms.gle\u002F6nf1QdPYdvC2vgxM8)\n\n## 💡 Introduction\n\nVILA is a family of open VLMs designed to optimize both efficiency and accuracy for efficient video understanding and multi-image understanding. \n\n## 💡 News\n- \\[2025\u002F7\\] We release [OmniVinci](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FOmniVinci), a state-of-the-art visual-audio joint understanding omni-modal LLM built upon VILA codebase!\n- \\[2025\u002F7\\] We release [Long-RL](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLong-RL) that supports RL training on VILA\u002FLongVILA\u002FNVILA models with long videos.\n- \\[2025\u002F6\\] We release [PS3 and VILA-HD](https:\u002F\u002Fnvlabs.github.io\u002FPS3\u002F). PS3 is a vision encoder that scales up vision pre-training to 4K resolution. VILA-HD is VILA with PS3 as the vision encoder and shows superior performance and efficiency in understanding high-resolution detail-rich images.\n- \\[2025\u002F1\\] As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models.\n- \\[2024\u002F12\\] We release [NVILA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04468) (a.k.a VILA2.0) that explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance.\n- \\[2024\u002F12\\] We release [LongVILA](.\u002Flongvila\u002FREADME.md) that supports long video understanding, with long-context VLM with more than 1M context length and multi-modal sequence parallel system.\n- \\[2024\u002F10\\] VILA-M3, a SOTA medical VLM finetuned on VILA1.5 is released! VILA-M3 significantly outperforms Llava-Med and on par w\u002F Med-Gemini and is fully opensourced! [code](https:\u002F\u002Fgithub.com\u002FProject-MONAI\u002FVLM#-news) [model](https:\u002F\u002Fhuggingface.co\u002FMONAI)\n- \\[2024\u002F10\\] We release [VILA-U](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fvila-u): a Unified foundation model that integrates Video, Image, Language understanding and generation.\n- \\[2024\u002F07\\] VILA1.5 also ranks 1st place (OSS model) on [MLVU test leaderboard](https:\u002F\u002Fgithub.com\u002FJUNJIE99\u002FMLVU).\n- \\[2024\u002F06\\] VILA1.5 is now the best open sourced VLM on [MMMU leaderboard](https:\u002F\u002Fmmmu-benchmark.github.io\u002F#leaderboard) and [Video-MME](https:\u002F\u002Fvideo-mme.github.io\u002Fhome_page.html#leaderboard) leaderboard!\n- \\[2024\u002F05\\] We release VILA-1.5, which offers **video understanding capability**. VILA-1.5 comes with four model sizes: 3B\u002F8B\u002F13B\u002F40B.\n\n\u003Cdetails>\n\u003Csummary>Click to show more news\u003C\u002Fsummary>\n\n- \\[2024\u002F05\\] We release [AWQ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00978.pdf)-quantized 4bit VILA-1.5 models. VILA-1.5 is efficiently deployable on diverse NVIDIA GPUs (A100, 4090, 4070 Laptop, Orin, Orin Nano) by [TinyChat](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftree\u002Fmain\u002Ftinychat) and [TensorRT-LLM](demo_trt_llm) backends.\n- \\[2024\u002F03\\] VILA has been accepted by CVPR 2024!\n- \\[2024\u002F02\\] We release [AWQ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00978.pdf)-quantized 4bit VILA models, deployable on Jetson Orin and laptops through [TinyChat](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftree\u002Fmain\u002Ftinychat) and [TinyChatEngine](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002FTinyChatEngine).\n- \\[2024\u002F02\\] VILA is released. We propose interleaved image-text pretraining that enables **multi-image** VLM. VILA comes with impressive in-context learning capabilities. We open source everything: including training code, evaluation code, datasets, model ckpts.\n- \\[2023\u002F12\\] [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07533) is on Arxiv!\n\n\u003C\u002Fdetails>\n\n## Performance\n\n## Image Benchmarks\n\n![](https:\u002F\u002Fnvlabs.github.io\u002FVILA\u002Fasset\u002Fimage_results.png)\n\n### Video  Benchmarks\n\n![](https:\u002F\u002Fnvlabs.github.io\u002FVILA\u002Fasset\u002Fvideo_results.png)\n\n### Efficient Deployments\n\n![](https:\u002F\u002Fnvlabs.github.io\u002FVILA\u002Fasset\u002Fdeployment_viz.png)\n\n\u003Csup>NOTE: Measured using the [TinyChat](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftinychat) backend at batch size = 1.\u003C\u002Fsup>\n\n### Inference Performance\n\n#### Decoding Throughput ( Token\u002Fsec )\n\n| $~~~~~~$                    |  A100  | 4090  | Orin |\n| --------------------------- |  ----- | ----- | ---- |\n| NVILA-3B-Baseline           |  140.6 | 190.5 | 42.7 |\n| NVILA-3B-TinyChat           |  184.3 | 230.5 | 45.0 |\n| NVILA-Lite-3B-Baseline      |  142.3 | 190.0 | 41.3 |\n| NVILA-Lite-3B-TinyChat      |  186.0 | 233.9 | 44.9 |\n| NVILA-8B-Baseline           |  82.1  | 61.9  | 11.6 |\n| NVILA-8B-TinyChat           |  186.8 | 162.7 | 28.1 |\n| NVILA-Lite-8B-Baseline      |  84.0  | 62.0  | 11.6 |\n| NVILA-Lite-8B-TinyChat      |  181.8 | 167.5 | 32.8 |\n| NVILA-Video-8B-Baseline *   |  73.2  | 58.4  | 10.9 |\n| NVILA-Video-8B-TinyChat *   |  151.8 | 145.0 | 32.3 |\n\n#### TTFT (Time-To-First-Token) ( Sec )\n\n| $~~~~~~$                    |   A100  |  4090  |  Orin  |\n| --------------------------- |  ------ | ------ | ------ |\n| NVILA-3B-Baseline           |  0.0329 | 0.0269 | 0.1173 |\n| NVILA-3B-TinyChat           |  0.0260 | 0.0188 | 0.1359 |\n| NVILA-Lite-3B-Baseline      |  0.0318 | 0.0274 | 0.1195 |\n| NVILA-Lite-3B-TinyChat      |  0.0314 | 0.0191 | 0.1241 |\n| NVILA-8B-Baseline           |  0.0434 | 0.0573 | 0.4222 |\n| NVILA-8B-TinyChat           |  0.0452 | 0.0356 | 0.2748 |\n| NVILA-Lite-8B-Baseline      |  0.0446 | 0.0458 | 0.2507 |\n| NVILA-Lite-8B-TinyChat      |  0.0391 | 0.0297 | 0.2097 |\n| NVILA-Video-8B-Baseline *   |  0.7190 | 0.8840 | 5.8236 |\n| NVILA-Video-8B-TinyChat *   |  0.6692 | 0.6815 | 5.8425 |\n\n\u003Csup>NOTE: Measured using the [TinyChat](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftinychat) backend at batch size = 1, dynamic_s2 disabled, and num_video_frames = 64. We use W4A16 LLM and W8A8 Vision Tower for Tinychat and the baseline precision is FP16.\u003C\u002Fsup>\n\u003Csup>\\*: Measured with video captioning task. Otherwise, measured with image captioning task.\u003C\u002Fsup>\n\n## VILA Examples\n\n### Video captioning\n\nhttps:\u002F\u002Fgithub.com\u002FEfficient-Large-Model\u002FVILA\u002Fassets\u002F156256291\u002Fc9520943-2478-4f97-bc95-121d625018a6\n\nPrompt: Elaborate on the visual and narrative elements of the video in detail.\n\nCaption: The video shows a person's hands working on a white surface. They are folding a piece of fabric with a checkered pattern in shades of blue and white. The fabric is being folded into a smaller, more compact shape. The person's fingernails are painted red, and they are wearing a black and red garment. There are also a ruler and a pencil on the surface, suggesting that measurements and precision are involved in the process.\n\n### In context learning\n\n\u003Cimg src=\"demo_images\u002Fdemo_img_1.png\" height=\"239\">\n\u003Cimg src=\"demo_images\u002Fdemo_img_2.png\" height=\"250\">\n\n### Multi-image reasoning\n\n\u003Cimg src=\"demo_images\u002Fdemo_img_3.png\" height=\"193\">\n\n### VILA on Jetson Orin\n\nhttps:\u002F\u002Fgithub.com\u002FEfficient-Large-Model\u002FVILA\u002Fassets\u002F7783214\u002F6079374c-0787-4bc4-b9c6-e1524b4c9dc4\n\n### VILA on RTX 4090\n\nhttps:\u002F\u002Fgithub.com\u002FEfficient-Large-Model\u002FVILA\u002Fassets\u002F7783214\u002F80c47742-e873-4080-ad7d-d17c4700539f\n\n## Installation\n\n1.  Install [Anaconda Distribution](https:\u002F\u002Fwww.anaconda.com\u002Fdownload).\n2.  Install the necessary Python packages in the environment.\n\n    ```bash\n    .\u002Fenvironment_setup.sh vila\n    ```\n\n3.  (Optional) If you are an NVIDIA employee with a wandb account, install\n    onelogger and enable it by setting `training_args.use_one_logger` to `True`\n    in `llava\u002Ftrain\u002Fargs.py`.\n\n    ```bash\n    pip install --index-url=https:\u002F\u002Fsc-hw-artf.nvidia.com\u002Fartifactory\u002Fapi\u002Fpypi\u002Fhwinf-mlwfo-pypi\u002Fsimple --upgrade one-logger-utils\n    ```\n\n4.  Activate a conda environment.\n\n    ```bash\n    conda activate vila\n    ```\n\n## Training\n\nVILA training contains three steps, for specific hyperparameters, please check out the [scripts\u002FNVILA-Lite](scripts\u002FNVILA-Lite) folder:\n\n### Step-1: Alignment\n\nWe utilize LLaVA-CC3M-Pretrain-595K dataset to align the textual and visual modalities.\n\nThe stage 1 script takes in two parameters and it can run on a single 8xA100 node.\n\n```bash\nbash scripts\u002FNVILA-Lite\u002Falign.sh Efficient-Large-Model\u002FQwen2-VL-7B-Instruct \u003Calias to data>\n```\n\nand the trained models will be saved to `runs\u002Ftrain\u002Fnvila-8b-align`.\n\n### Step-1.5:\n\n```bash\nbash scripts\u002FNVILA-Lite\u002Fstage15.sh runs\u002Ftrain\u002Fnvila-8b-align\u002Fmodel \u003Calias to data>\n```\n\nand the trained models will be saved to `runs\u002Ftrain\u002Fnvila-8b-align-1.5`.\n\n### Step-2: Pretraining\n\nWe use MMC4 and Coyo dataset to train VLM with interleaved image-text pairs.\n\n```bash\nbash scripts\u002FNVILA-Lite\u002Fpretrain.sh runs\u002Ftrain\u002Fnvila-8b-align-1.5 \u003Calias to data>\n```\n\nand the trained models will be saved to `runs\u002Ftrain\u002Fnvila-8b-pretraining`.\n\n### Step-3: Supervised fine-tuning\n\nThis is the last stage of VILA training, in which we tune the model to follow multimodal instructions on a subset of M3IT, FLAN and ShareGPT4V. This stage runs on a 8xA100 node.\n\n```bash\nbash scripts\u002FNVILA-Lite\u002Fsft.sh runs\u002Ftrain\u002Fnvila-8b-pretraining \u003Calias to data>\n```\n\nand the trained models will be saved to `runs\u002Ftrain\u002Fnvila-8b-SFT`.\n\n## Evaluations\n\nWe have introduce `vila-eval` command to simplify the evaluation. Once the data is prepared, the evaluation can be launched via\n\n```bash\nMODEL_NAME=NVILA-15B\nMODEL_ID=Efficient-Large-Model\u002F$MODEL_NAME\nhuggingface-cli download $MODEL_ID\n\nvila-eval \\\n    --model-name $MODEL_NAME \\\n    --model-path $MODEL_ID \\\n    --conv-mode auto \\\n    --tags-include local\n```\n\nit will launch all evaluations and return a summarized result.\n\n## Inference\n\nWe provide `vila-infer` for quick inference with user prompts and images.\n\n```bash\n# image description\nvila-infer \\\n    --model-path Efficient-Large-Model\u002FNVILA-15B \\\n    --conv-mode auto \\\n    --text \"Please describe the image\" \\\n    --media demo_images\u002Fdemo_img.png\n\n# video description\nvila-infer \\\n    --model-path Efficient-Large-Model\u002FNVILA-15B \\\n    --conv-mode auto \\\n    --text \"Please describe the video\" \\\n    --media https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEfficient-Large-Model\u002FVILA-inference-demos\u002Fresolve\u002Fmain\u002FOAI-sora-tokyo-walk.mp4\n```\n\n`vila-infer` is also compatible with VILA-1.5 models. For example:\n\n```bash\nvila-infer \\\n    --model-path Efficient-Large-Model\u002FVILA1.5-3b \\\n    --conv-mode vicuna_v1 \\\n    --text \"Please describe the image\" \\\n    --media demo_images\u002Fdemo_img.png\n\nvila-infer \\\n    --model-path Efficient-Large-Model\u002FVILA1.5-3b \\\n    --conv-mode vicuna_v1 \\\n    --text \"Please describe the video\" \\\n    --media https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEfficient-Large-Model\u002FVILA-inference-demos\u002Fresolve\u002Fmain\u002FOAI-sora-tokyo-walk.mp4\n\n\nvila-infer \\\n    --model-path Efficient-Large-Model\u002FNVILA-15B \\\n    --conv-mode auto \\\n    --text \"Please describe the video\" \\\n    --media https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEfficient-Large-Model\u002FVILA-inference-demos\u002Fresolve\u002Fmain\u002FOAI-sora-tokyo-walk.mp4\n```\n\n## Quantization and Deployment\n\nOur VILA models are quantized by [AWQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00978) into 4 bits for efficient inference on the edge. We provide a push-the-button [script](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Fblob\u002Fmain\u002Fscripts\u002Fnvila_example.sh) to quantize VILA with AWQ, along with [pre-quantized weights](https:\u002F\u002Fhuggingface.co\u002FEfficient-Large-Model\u002FNVILA-AWQ) so you can try them out directly.\n\n### Running VILA on desktop GPUs and edge GPUs\n\nWe support AWQ-quantized 4bit VILA on GPU platforms via [TinyChat](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftree\u002Fmain\u002Ftinychat). We provide a [tutorial](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftree\u002Fmain\u002Ftinychat#support-vlm-models-vila--llava) to run the model with TinyChat after quantization. We also provide an [instruction](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftree\u002Fmain\u002Ftinychat\u002Fserve) to launch a Gradio server (powered by TinyChat and AWQ) to serve 4-bit quantized VILA models.\n\n### Running VILA on laptops\n\nWe further support our AWQ-quantized 4bit VILA models on various CPU platforms with both x86 and ARM architectures with our [TinyChatEngine](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002FTinyChatEngine). We also provide a detailed [tutorial](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002FTinyChatEngine\u002Ftree\u002Fmain?tab=readme-ov-file#deploy-vision-language-model-vlm-chatbot-with-tinychatengine) to help the users deploy VILA on different CPUs.\n\n### Running VILA API server\n\nA simple API server has been provided to serve VILA models. The server is built on top of [FastAPI](https:\u002F\u002Ffastapi.tiangolo.com\u002F) and [Huggingface Transformers](https:\u002F\u002Fhuggingface.co\u002Ftransformers\u002F). The server can be run with the following command:\n\n#### With CLI\n\n```bash\npython -W ignore server.py \\\n    --port 8000 \\\n    --model-path Efficient-Large-Model\u002FNVILA-15B \\\n    --conv-mode auto\n```\n\n#### With Docker\n\n```bash\ndocker build -t vila-server:latest .\ndocker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \\\n    -v .\u002Fhub:\u002Froot\u002F.cache\u002Fhuggingface\u002Fhub \\\n    -it --rm -p 8000:8000 \\\n    -e VILA_MODEL_PATH=Efficient-Large-Model\u002FNVILA-15B \\\n    -e VILA_CONV_MODE=auto \\\n    vila-server:latest\n```\n\nThen you can call the endpoint with the OpenAI SDK as follows:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http:\u002F\u002Flocalhost:8000\",\n    api_key=\"fake-key\",\n)\nresponse = client.chat.completions.create(\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": \"What’s in this image?\"},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\n                        \"url\": \"https:\u002F\u002Fblog.logomyway.com\u002Fwp-content\u002Fuploads\u002F2022\u002F01\u002FNVIDIA-logo.jpg\",\n                        # Or you can pass in a base64 encoded image\n                        # \"url\": \"data:image\u002Fpng;base64,\u003Cbase64_encoded_image>\",\n                    },\n                },\n            ],\n        }\n    ],\n    model=\"NVILA-15B\",\n)\nprint(response.choices[0].message.content)\n```\n\n\u003Csup>NOTE: This API server is intended for evaluation purposes only and has not been optimized for production use. SGLang support is coming on the way.\u003C\u002Fsup>\n\n## Checkpoints\n\nWe release the following models:\n\n- NVILA-8B \u002F NVILA-8B-Lite\n- NVILA-15B \u002F NVILA-15B-Lite\n\n## VILA-HD\n\nPlease refer to `vila_hd\u002F`\n\n## 🔒 License\n\n- The code is released under the Apache 2.0 license as found in the [LICENSE](.\u002FLICENSE) file.\n- The pretrained weights are released under the [CC-BY-NC-SA-4.0 license](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002Fdeed.en).\n- The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:\n  - [Model License](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama\u002Fblob\u002Fmain\u002FMODEL_CARD.md) of LLaMA. For LLAMA3-VILA checkpoints terms of use, please refer to the [LLAMA3 License](https:\u002F\u002Fllama.meta.com\u002Fllama3\u002Flicense\u002F) for additional details.\n  - [Terms of Use](https:\u002F\u002Fopenai.com\u002Fpolicies\u002Fterms-of-use) of the data generated by OpenAI\n  - [Dataset Licenses](.\u002Fdata_prepare\u002FLICENSE) for each one used during training.\n\n## Team\n\nNVILA Core contributors: [Zhijian Liu](https:\u002F\u002Fzhijianliu.com), [Ligeng Zhu](https:\u002F\u002Flzhu.me\u002F), [Baifeng Shi](https:\u002F\u002Fbfshi.github.io\u002F), [Zhuoyang Zhang](https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Zhuoyang_Zhang1), [Yuming Lou](\u003C>), [Shang Yang](https:\u002F\u002Fys-2020.github.io\u002F), [Haocheng Xi](\u003C>), [Shiyi Cao](\u003C>), [Yuxian Gu](\u003C>), [Dacheng Li](\u003C>), [Xiuyu Li](\u003C>), [Yunhao Fang](https:\u002F\u002Fseerkfang.github.io\u002F), [Yukang Chen](https:\u002F\u002Fyukangchen.com\u002F), [Cheng-Yu Hsieh](\u003C>), [De-An Huang](\u003C>), [An-Chieh Cheng](\u003C>), [Vishwesh Nath](\u003C>), [Jinyi Hu](\u003C>), [Sifei Liu](\u003C>), [Ranjay Krishna](\u003C>), [Daguang Xu](\u003C>), [Xiaolong Wang](\u003C>), [Pavlo Molchanov](https:\u002F\u002Fwww.pmolchanov.com\u002F), [Jan Kautz](https:\u002F\u002Fjankautz.com\u002F), [Hongxu Yin](https:\u002F\u002Fhongxu-yin.github.io\u002F), [Song Han](http:\u002F\u002Fsonghan.mit.edu\u002F), [Yao Lu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=OI7zFmwAAAAJ&hl=en)\n\nLongVILA contributors: [Yukang Chen](https:\u002F\u002Fyukangchen.com\u002F), [Fuzhao Xue](https:\u002F\u002Fxuefuzhao.github.io\u002F), [Dacheng Li](\u003Chttps:\u002F\u002Fdachengli1.github.io>), [Qinghao Hu](\u003Chttps:\u002F\u002Ftonyhao.xyz>), [Ligeng Zhu](https:\u002F\u002Flzhu.me\u002F), [Xiuyu Li](\u003Chttps:\u002F\u002Fxiuyuli.com>), [Yunhao Fang](https:\u002F\u002Fseerkfang.github.io\u002F), [Haotian Tang](http:\u002F\u002Fkentang.net\u002F), [Shang Yang](https:\u002F\u002Fys-2020.github.io\u002F), [Zhijian Liu](https:\u002F\u002Fzhijianliu.com), [Ethan He](\u003C>), [Hongxu Yin](https:\u002F\u002Fhongxu-yin.github.io\u002F), [Pavlo Molchanov](https:\u002F\u002Fwww.pmolchanov.com\u002F), [Jan Kautz](\u003Chttps:\u002F\u002Fjankautz.com>), [Linxi Fan](\u003Chttps:\u002F\u002Fjimfan.me>), [Yuke Zhu](\u003Chttps:\u002F\u002Fyukezhu.me>), [Yao Lu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=OI7zFmwAAAAJ&hl=en), [Song Han](http:\u002F\u002Fsonghan.mit.edu\u002F)\n\nVILA-HD contributors: [Baifeng Shi](https:\u002F\u002Fbfshi.github.io), [Boyi Li](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fboyilics\u002Fhome), [Han Cai](https:\u002F\u002Fhan-cai.github.io\u002F), [Yao Lu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=OI7zFmwAAAAJ&hl=en), [Sifei Liu](https:\u002F\u002Fsifeiliu.net\u002F), [Marco Pavone](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fmarco-pavone), [Jan Kautz](\u003Chttps:\u002F\u002Fjankautz.com>), [Song Han](http:\u002F\u002Fsonghan.mit.edu\u002F), [Trevor Darrell](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~trevor\u002F),  [Pavlo Molchanov](https:\u002F\u002Fwww.pmolchanov.com\u002F), [Hongxu Yin](https:\u002F\u002Fhongxu-yin.github.io\u002F)\n\n\u003Cdetails>\n\u003Csummary> VILA-1.5 contributors \u003C\u002Fsummary>\n\n[\\*Yao Lu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=OI7zFmwAAAAJ&hl=en): Nvidia, [\\*Hongxu Yin](https:\u002F\u002Fhongxu-yin.github.io\u002F): Nvidia, [\\*Ji Lin](https:\u002F\u002Fwww.linji.me\u002F): OpenAI (work done at Nvidia and MIT), [Wei Ping](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=6gKEYRgAAAAJ&hl=en): Nvidia, [Pavlo Molchanov](https:\u002F\u002Fwww.pmolchanov.com\u002F): Nvidia, [Andrew Tao](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Wel9l1wAAAAJ&hl=en): Nvidia, [Haotian Tang](http:\u002F\u002Fkentang.net\u002F): MIT, [Shang Yang](https:\u002F\u002Fys-2020.github.io\u002F): MIT, [Ligeng Zhu](https:\u002F\u002Flzhu.me\u002F): Nvidia, MIT, [Wei-Chen Wang](https:\u002F\u002Fweichenwang.me\u002F): MIT, [Fuzhao Xue](https:\u002F\u002Fxuefuzhao.github.io\u002F): Nvidia, NUS, [Yunhao Fang](https:\u002F\u002Fseerkfang.github.io\u002F): Nvidia, UCSD, [Yukang Chen](https:\u002F\u002Fyukangchen.com\u002F): Nvidia, [Zhuoyang Zhang](https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Zhuoyang_Zhang1): Nvidia, [Yue Shen](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fyue-james-shen\u002F): Nvidia, [Wei-Ming Chen](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=6xFvyJwAAAAJ&hl=en): Nvidia, [Huizi Mao](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=r5WezOYAAAAJ&hl=zh-CN): Nvidia, [Baifeng Shi](https:\u002F\u002Fbfshi.github.io\u002F): Nvidia, UC Berkeley, [Jan Kautz](https:\u002F\u002Fjankautz.com\u002F): Nvidia, [Mohammad Shoeybi](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=62ElavIAAAAJ&hl=en): Nvidia, [Song Han](http:\u002F\u002Fsonghan.mit.edu\u002F): Nvidia, MIT\n\n\u003C\u002Fdetails>\n\n## Citations\n\n```bibtex\n@misc{liu2024nvila,\n      title={NVILA: Efficient Frontier Visual Language Models},\n      author={Zhijian Liu and Ligeng Zhu and Baifeng Shi and Zhuoyang Zhang and Yuming Lou and Shang Yang and Haocheng Xi and Shiyi Cao and Yuxian Gu and Dacheng Li and Xiuyu Li and Yunhao Fang and Yukang Chen and Cheng-Yu Hsieh and De-An Huang and An-Chieh Cheng and Vishwesh Nath and Jinyi Hu and Sifei Liu and Ranjay Krishna and Daguang Xu and Xiaolong Wang and Pavlo Molchanov and Jan Kautz and Hongxu Yin and Song Han and Yao Lu},\n      year={2024},\n      eprint={2412.04468},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04468},\n}\n```\n```bibtex\n@article{chen2025longvila-r1,\n      title={Scaling RL to Long Videos},\n      author={Yukang Chen and Wei Huang and Baifeng Shi and Qinghao Hu and Hanrong Ye and Ligeng Zhu and Zhijian Liu and Pavlo Molchanov and Jan Kautz and Xiaojuan Qi and Sifei Liu and Hongxu Yin and Yao Lu and Song Han},\n      year={2025},\n      eprint={2507.07966},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n```bibtex\n@misc{chen2024longvila,\n      title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},\n      author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},\n      year={2024},\n      eprint={2408.10188},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n```bibtex\n@misc{shi2025scaling,\n      title={Scaling Vision Pre-Training to 4K Resolution}, \n      author={Baifeng Shi and Boyi Li and Han Cai and Yao Lu and Sifei Liu and Marco Pavone and Jan Kautz and Song Han and Trevor Darrell and Pavlo Molchanov and Hongxu Yin},\n      year={2025},\n      eprint={2503.19903},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19903},\n}\n```\n\n```bibtex\n@misc{lin2023vila,\n      title={VILA: On Pre-training for Visual Language Models},\n      author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},\n      year={2023},\n      eprint={2312.07533},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n# Acknowledgement\n\n- [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA): the codebase we built upon. Thanks for their wonderful work.\n- [InternVL](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL): for open-sourcing InternViT (used in VILA1.5-40b) and the [InternVL-SFT](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Ftree\u002Fmain\u002Finternvl_chat#prepare-training-datasets) data blend (inspired by LLaVA-1.6) used in all VILA1.5 models.\n- [Vicuna](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat): the amazing open-sourced large language model!\n- [Video-ChatGPT](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT): we borrowed video evaluation script from this repository.\n- [MMC4](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fmmc4), [COYO-700M](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fcoyo-dataset), [M3IT](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMMInstruction\u002FM3IT), [OpenORCA\u002FFLAN](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpen-Orca\u002FFLAN), [ShareGPT4V](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer\u002Ftree\u002Fmain\u002Fprojects\u002FShareGPT4V), [WIT](google-research-datasets\u002Fwit), [GSM8K-ScRel](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002Fgsm8k-ScRel\u002Fblob\u002Fmain\u002Fdata\u002Ftrain_use.jsonl), [VisualGenome](https:\u002F\u002Fvisualgenome.org\u002Fapi\u002Fv0\u002Fapi_home.html), [VCR](https:\u002F\u002Fvisualcommonsense.com\u002Fdownload\u002F), [ScienceQA](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fderek-thomas\u002FScienceQA), [Shot2Story](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FShot2Story\u002Fblob\u002Fmaster\u002FDATA.md), [Youcook2](http:\u002F\u002Fyoucook2.eecs.umich.edu\u002F), [Vatex](https:\u002F\u002Feric-xw.github.io\u002Fvatex-website\u002Fdownload.html), [ShareGPT-Video](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FShareGPTVideo\u002Ftrain_video_and_instruction) for providing datasets used in this research.\n","VILA是一系列专为边缘、数据中心和云端设计的先进视觉语言模型，旨在优化多模态AI任务中的效率与准确性。该项目的核心功能包括高效的视频理解和多图像理解能力，并且支持从3B到40B不同规模的模型选择，以适应不同的计算资源需求。技术上，VILA采用了创新的视觉编码器（如PS3）来处理高分辨率图像，以及针对长视频理解优化的设计（如LongVILA）。此外，通过AWQ等量化技术，VILA还实现了对低精度硬件环境的良好支持。这些特性使得VILA非常适合需要在保证高性能的同时兼顾部署成本的应用场景，比如智能监控、医疗影像分析及内容推荐系统等领域。",2,"2026-06-11 03:40:57","high_star"]