[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79102":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},79102,"LightVLM","cortsdine\u002FLightVLM","cortsdine","Efficient inference toolkit for vision-language models: KV-cache compression, INT4\u002FINT8 quantization, and visual token pruning.",null,"Python",222,6467,6,0,180,10,"Other",false,"main",[],"2026-06-12 02:03:49","# LightVLM 🔦\n\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%2B-blue?style=flat)](https:\u002F\u002Fwww.python.org\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-green?style=flat)](LICENSE)\n[![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.1%2B-ee4c2c?style=flat&logo=pytorch&logoColor=white)](https:\u002F\u002Fpytorch.org\u002F)\n[![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcortsdine\u002FLightVLM?style=flat&logo=github)](https:\u002F\u002Fgithub.com\u002Fcortsdine\u002FLightVLM)\n[![CI](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCI-passing-brightgreen?style=flat)](https:\u002F\u002Fgithub.com\u002Fcortsdine\u002FLightVLM)\n\nLightVLM is a small, hackable inference toolkit for vision-language models. It bundles\nKV-cache compression, low-bit weight quantization, and visual token pruning behind a\nsingle Python API so you can squeeze 7B-class VLMs onto a single consumer GPU without\ngiving up much quality.\n\n## Features\n\n- **KV cache compression** — H2O- and StreamingLLM-style eviction policies for long\n  multi-turn chats with images.\n- **INT8 \u002F INT4 quantization** — SmoothQuant-style activation smoothing for INT8 and a\n  GPTQ-flavored INT4 path for weights, built on top of `bitsandbytes`.\n- **Visual token pruning** — FastV-style attention-based pruning that drops redundant\n  image tokens after the first few layers.\n- **Throughput benchmarks** — reproducible scripts for tok\u002Fs and peak GPU memory across\n  precisions and batch sizes.\n- **Pluggable model wrappers** — LLaVA and Qwen-VL out of the box; add new models with a\n  small registry entry.\n\n## Why LightVLM\n\nMost open VLM checkpoints are released as full-precision PyTorch modules that assume you\nhave an A100 lying around. In practice a lot of useful work — captioning, OCR, visual\nQA — happens on a single 24 GB consumer card or a shared lab box. The optimization\npieces (quantization, KV compression, token pruning) exist in scattered research repos\nthat don't talk to each other.\n\nLightVLM is an attempt to glue those pieces together with a stable Python API and a\nsmall CLI, so that turning on INT4 weights + KV eviction + visual token pruning is one\nconfig block instead of three forks. It is research-quality code, not a production\nserving stack, but it should be readable end to end.\n\n## Installation\n\n```bash\npip install lightvlm\n# or with optional extras\npip install \"lightvlm[quant,bench]\"\n```\n\nFrom source:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fcortsdine\u002FLightVLM.git\ncd LightVLM\npip install -e \".[quant,bench]\"\n```\n\nCUDA 12.1 and PyTorch 2.1+ are recommended.\n\n## Quick Start\n\n```python\nfrom lightvlm import load_model\nfrom lightvlm.quant import QuantConfig\nfrom lightvlm.kv_cache import CompressorConfig\n\nmodel = load_model(\n    \"llava-hf\u002Fllava-1.5-7b-hf\",\n    quant=QuantConfig(bits=4, group_size=128),\n    kv=CompressorConfig(policy=\"h2o\", budget=512),\n    device=\"cuda\",\n)\n\nout = model.generate(\n    image=\"examples\u002Fcat.jpg\",\n    prompt=\"What is unusual about this picture?\",\n    max_new_tokens=128,\n)\nprint(out.text)\n```\n\n## Supported Models\n\n| Family   | Checkpoint                              | INT8 | INT4 | KV Compress | Visual Prune |\n|----------|-----------------------------------------|------|------|-------------|--------------|\n| LLaVA    | `llava-hf\u002Fllava-1.5-7b-hf`              | yes  | yes  | yes         | yes          |\n| LLaVA    | `llava-hf\u002Fllava-1.5-13b-hf`             | yes  | yes  | yes         | yes          |\n| Qwen-VL  | `Qwen\u002FQwen-VL-Chat`                     | yes  | yes  | yes         | partial      |\n| Qwen2-VL | `Qwen\u002FQwen2-VL-7B-Instruct`             | yes  | yes  | yes         | yes          |\n\n## Benchmarks\n\nMeasured on a single RTX 4090 (24 GB), CUDA 12.1, batch size 1, 512 prompt tokens,\n128 generated tokens. Numbers are illustrative — your mileage will vary.\n\n| Model           | Precision | Tokens\u002Fs | Peak GPU Mem |\n|-----------------|-----------|----------|--------------|\n| LLaVA-1.5-7B    | fp16      | 38.2     | 15.1 GB      |\n| LLaVA-1.5-7B    | int8      | 51.7     | 9.6 GB       |\n| LLaVA-1.5-7B    | int4      | 63.4     | 6.8 GB       |\n| LLaVA-1.5-7B    | int4+kv   | 71.9     | 6.2 GB       |\n| Qwen2-VL-7B     | int4      | 58.1     | 7.4 GB       |\n\nReproduce with:\n\n```bash\npython benchmarks\u002Frun_throughput.py --config benchmarks\u002Fconfigs.yaml\n```\n\n## CLI\n\n```bash\n# quick generation\nlightvlm generate --model llava-hf\u002Fllava-1.5-7b-hf \\\n    --quant int4 --image cat.jpg --prompt \"describe\"\n\n# offline calibration for INT8 SmoothQuant\nlightvlm calibrate --model llava-hf\u002Fllava-1.5-7b-hf \\\n    --dataset coco --num-samples 128 --out smooth.pt\n\n# benchmark\nlightvlm bench --config benchmarks\u002Fconfigs.yaml\n```\n\n## Contributing\n\nBug reports and patches are very welcome. Please open an issue before sending a large\nchange so we can discuss the design. Run `pytest -q` and `ruff check .` before pushing.\n\n## Acknowledgments\n\nLightVLM stands on the shoulders of PyTorch, Hugging Face `transformers`, and\n`bitsandbytes`, as well as the open research on H2O, StreamingLLM, SmoothQuant, GPTQ,\nand FastV. Thanks to everyone publishing reproducible code.\n\n## License\n\nApache License 2.0 — see [LICENSE](LICENSE).\n","LightVLM 是一个高效的视觉-语言模型推理工具包，支持KV缓存压缩、INT4\u002FINT8量化和视觉令牌剪枝。其核心功能包括H2O和StreamingLLM风格的KV缓存压缩策略、基于SmoothQuant和GPTQ的低比特量化技术以及FastV风格的注意力驱动令牌剪枝方法。这些优化手段通过统一的Python API接口提供，使得70亿参数级别的视觉-语言模型能够在单个消费级GPU上运行而不显著牺牲性能。该工具包适用于图像字幕生成、OCR识别、视觉问答等场景，尤其适合那些只能访问到有限计算资源（如单张24GB显存的消费级显卡）的研究人员或开发者使用。",2,"2026-06-01 03:48:04","CREATED_QUERY"]