[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72321":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},72321,"llm-compressor","vllm-project\u002Fllm-compressor","vllm-project","Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM","https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Fllm-compressor",null,"Python",3382,544,30,61,0,39,79,170,117,30.21,"Apache License 2.0",false,"main",true,[27,28],"compression","quantization","2026-06-12 02:03:01","\u003Cdiv align=\"center\">\n\n\u003Ch1>\n  \u003Cimg width=\"40\" alt=\"tool icon\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ff9b86465-aefa-4625-a09b-54e158efcf96\" \u002F>\n  \u003Cspan style=\"font-size:80px;\">LLM Compressor\u003C\u002Fspan>\n\u003C\u002Fh1>\n\n[![docs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-LLM--Compressor-blue)](https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Fllm-compressor\u002Fen\u002Flatest\u002F) [![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fllmcompressor.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fllmcompressor\u002F)\n\n\u003C\u002Fdiv>\n\n`llmcompressor` is an easy-to-use library for optimizing models for deployment with vLLM, including:\n\n* Comprehensive set of quantization algorithms and transforms for weight, activation, KV Cache, and attention quantization\n* Seamless integration with Hugging Face models and repositories\n* Models saved in the `compressed-tensors` format, compatible with vLLM\n* DDP and disk offloading support for compressing very large models\n\n**✨ Read the announcement blog [here](https:\u002F\u002Fneuralmagic.com\u002Fblog\u002Fllm-compressor-is-here-faster-inference-with-vllm\u002F)! ✨**\n\n\u003Cp align=\"center\">\n   \u003Cimg alt=\"LLM Compressor Flow\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fadf07594-6487-48ae-af62-d9555046d51b\" width=\"80%\" \u002F>\n\u003C\u002Fp>\n\n---\n\n📊 Help us improve by taking our [1-minute user survey](https:\u002F\u002Fred.ht\u002Fllm-compressor-user-survey)\n\n💬 Join us on the [vLLM Community Slack](https:\u002F\u002Fcommunityinviter.com\u002Fapps\u002Fvllm-dev\u002Fjoin-vllm-developers-slack) and share your questions, thoughts, or ideas in:\n\n- `#sig-quantization`\n- `#llm-compressor`\n\n---\n## 🚀 What's New!\n\nBig updates have landed in LLM Compressor! To get a more in-depth look, check out the [LLM Compressor overview](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1WNkYBKv_CsrYs69lb7bJKjh2dWt8U1HXUw7Gr4Wn3gE\u002Fedit?usp=sharing).\n\nSome of the exciting new features include:\n\n* **DeepSeek-V4-Flash and Kimi-K2.6 Quantized Checkpoints**: Quantized checkpoints for DeepSeek-V4-Flash and Kimi-K2.6 have been generated by the RedHat team and posted to the HF hub. Consider using:\n  - [DeepSeek-V4-Flash-NVFP4-FP8](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002FDeepSeek-V4-Flash-NVFP4-FP8) — 163B DeepSeek-V4-Flash quantized to NVFP4 weights with FP8 KV cache\n  - [Kimi-K2.6-NVFP4](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002FKimi-K2.6-NVFP4) — Kimi-K2.6 quantized to NVFP4 (weights and activations), targeting NVIDIA Blackwell GPUs\n  - [Kimi-K2.6-FP8-BLOCK](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002FKimi-K2.6-FP8-BLOCK) — 1T parameter Kimi-K2.6 quantized to FP8 block format (weights and activations), compatible with DeepGEMM FP8 kernels\n* **Qwen3.6 NVFP4 Generated Checkpoint**: An [NVFP4 quantized checkpoint](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002FQwen3.6-35B-A3B-NVFP4) has been generated by the RedHat team and posted to the HF hub. Qwen3.6 follows the same architecture as Qwen3.5, so existing LLM Compressor examples can be used for this model by swapping out the target model string.\n* **Gemma4 Support**: Gemma 4 can now be quantized using LLM Compressor. Support is available through main and will require updating to transformers 5.5 (`uv pip install transformers>=5.5`). For models quantized and published by the RedHat team, consider using:\n  - [gemma-4-31B-it-NVFP4](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002Fgemma-4-31B-it-NVFP4)\n  - [gemma-4-31B-it-FP8-block](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002Fgemma-4-31B-it-FP8-block)\n  - [gemma-4-31B-it-FP8-Dynamic](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002Fgemma-4-31B-it-FP8-Dynamic)\n  - [gemma-4-26B-A4B-it-FP8-Dynamic](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002Fgemma-4-26B-A4B-it-FP8-Dynamic)\n  - [gemma-4-26B-A4B-it-NVFP4](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002Fgemma-4-26B-A4B-it-NVFP4)\n* **Qwen3.5 Support**: Qwen 3.5 can now be quantized using LLM Compressor. You will need to update your local transformers version using `uv pip install --upgrade transformers` and install LLM Compressor from source if using `\u003C0.11`. Once updated, you should be able to run examples for the [MoE](examples\u002Fquantization_w4a4_fp4\u002Fqwen3_5_example.py) and [non-MoE](examples\u002Fquantization_w4a4_fp4\u002Fqwen3_5_example.py) variants of Qwen 3.5 end-to-end. For models quantized and published by the RedHat team, consider using the [NVFP4](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002FQwen3.5-122B-A10B-NVFP4) and FP8 checkpoints for [Qwen3.5-122B](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002FQwen3.5-122B-A10B-FP8-dynamic) and [Qwen3.5-397B](https:\u002F\u002Fhuggingface.co\u002FRedHatAI\u002FQwen3.5-397B-A17B-FP8-dynamic).\n* **Updated offloading and model loading support**: Loading transformers models that are offloaded to disk and\u002For offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](docs\u002Fguides\u002Fbig_models_and_distributed\u002Fmodel_loading.md).\n* **Distributed GPTQ Support**: GPTQ now supports Distributed Data Parallel (DDP) functionality to significantly improve calibration runtime. An example using DDP with GPTQ can be found [here](examples\u002Fquantization_w4a16\u002Fllama3_ddp_example.py).\n* **Updated FP4 Microscale Support**: GPTQ now supports FP4 quantization schemes, including both [MXFP4](examples\u002Fquantization_w4a16_fp4\u002Fmxfp4\u002Fllama3_example.py) and [NVFP4](examples\u002Fquantization_w4a4_fp4\u002Fllama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models\n* **New Model-Free PTQ Pathway**: A new model-free PTQ pathway has been added to LLM Compressor, called [`model_free_ptq`](src\u002Fllmcompressor\u002Fentrypoints\u002Fmodel_free\u002F__init__.py#L36). This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where `oneshot` may fail. This pathway is currently supported for data-free pathways only i.e FP8 quantization and was leveraged to quantize the [Mistral Large 3 model](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FMistral-Large-3-675B-Instruct-2512). Additional [examples](examples\u002Fmodel_free_ptq) have been added illustrating how LLM Compressor can be used for Kimi K2\n* **MXFP8 Microscale Support**: LLM Compressor now supports MXFP8 quantization via PTQ. Both W8A8 ([MXFP8](examples\u002Fquantization_w8a8_mxfp8\u002Fqwen3_example_w8a8_mxfp8.py)) and W8A16 weight-only ([MXFP8A16](examples\u002Fquantization_w8a8_mxfp8\u002Fqwen3_example_w8a16_mxfp8.py)) modes are available.\n* **Extended KV Cache and Attention Quantization Support**: LLM Compressor now supports attention quantization, as well as fine-grained KV Cache quantization. Previously only per-tensor KV cache quantization was supported. Now, you can quantize KV cache with `per-head` scales and run with vLLM. Examples of more generalized attention and kv cache quantization can be found in the [experimental folder](experimental\u002Fattention).\n\n\n### Supported Precisions and Types\n* Activation Quantization: W8A8 (int8 and fp8), W4AFP8, Microscale (NVFP4, MXFP4, MXFP8)\n* Mixed Precision: W4A16, W8A16, MXFP8A16, MXFP4A16, NVFP4A16\n* Attention and KV Cache Quantization: FP8, NVFP4\n\n### Supported Algorithms\n* Simple PTQ\n* GPTQ\n* AWQ\n* SmoothQuant\n* AutoRound\n* Rotation-based (SpinQuant, QuIP)\n\n### Quantizing your model, step-by-step\n\nPlease refer to our [step-by-step compression guide](https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Fllm-compressor\u002Fen\u002Flatest\u002Fsteps\u002Fchoosing-model\u002F) for detailed information about selecting quantization schemes, algorithms, and their use cases.\n\nAdditional information about LLM Compressor functionality is also available in our [User Guides](https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Fllm-compressor\u002Fen\u002Flatest\u002Fguides\u002Fentrypoints\u002F)\n\n\n## Installation\n\n```bash\npip install llmcompressor\n```\n\n## Get Started\n\n### End-to-End Examples\n\nApplying quantization with `llmcompressor`:\n\n### Weight and Activation Quantization\n* [Activation quantization to `int8`](examples\u002Fquantization_w8a8_int8\u002FREADME.md)\n* [Activation quantization to `fp8`](examples\u002Fquantization_w8a8_fp8\u002FREADME.md)\n* [Activation quantization to MXFP8](examples\u002Fquantization_w8a8_mxfp8)\n* [Activation quantization to `fp4` (NVFP4)](examples\u002Fquantization_w4a4_fp4)\n* [Activation quantization to `fp4` (MXFP4)](examples\u002Fquantization_w4a4_mxfp4)\n* [Activation quantization to `fp4` using AutoRound](examples\u002Fautoround\u002Fquantization_w4a4_fp4\u002FREADME.md)\n* [Activation quantization to `fp8` and weight quantization to `int4`](examples\u002Fquantization_w4a8_fp8)\n\n### Weight Only Quantization\n* [Weight only quantization to `fp4` (NVFP4 format)](examples\u002Fquantization_w4a16_fp4\u002Fnvfp4)\n* [Weight only quantization to `fp4` (MXFP4 format)](examples\u002Fquantization_w4a16_fp4\u002Fmxfp4)\n* [Weight only quantization to `int4` using GPTQ](examples\u002Fquantization_w4a16\u002FREADME.md)\n* [Weight only quantization to `int4` using AWQ](examples\u002Fawq\u002FREADME.md)\n* [Weight only quantization to `int4` using AutoRound](examples\u002Fautoround\u002Fquantization_w4a16\u002FREADME.md)\n\n### Attention and KV Cache Quantization\n* [KV Cache quantization to `fp8`](examples\u002Fquantization_kv_cache\u002FREADME.md)\n* [KV Cache quantization to `fp8` using per-head](examples\u002Fquantization_kv_cache\u002Fllama3_fp8_head_kv_example.py)\n* [Attention quantization to `fp8`](examples\u002Fquantization_attention\u002FREADME.md)\n* [Attention quantization to `NVFP4` with SpinQuant (experimental)](experimental\u002Fattention\u002FREADME.md)\n\n### Architecture-Specific Quantization\n* [Quantizing MoE LLMs](examples\u002Fquantizing_moe\u002FREADME.md)\n* [Quantizing Vision-Language Models](examples\u002Fmultimodal_vision\u002FREADME.md)\n* [Quantizing Audio-Language Models](examples\u002Fmultimodal_audio\u002FREADME.md)\n\n### Non-Uniform Quantization\n* [Quantizing Models Non-uniformly](examples\u002Fquantization_non_uniform\u002FREADME.md)\n\n### Big Model Quantization Support\n* [Quantizing large models with sequential onloading](examples\u002Fbig_models_with_sequential_onloading\u002FREADME.md)\n* [Quantizing large models with disk offloading](examples\u002Fdisk_offloading\u002FREADME.md)\n\n### Model-Free Definition Quantization\n* [Quantizing models without a Hugging Face model definition](examples\u002Fmodel_free_ptq\u002FREADME.md)\n\n### DDP Quantization\n* [Distributed data parallel quantization with GPTQ](examples\u002Fquantization_w4a16\u002Fllama3_ddp_example.py)\n\n\n## Quick Tour\nLet's quantize `Qwen3-30B-A3B` with FP8 weights and activations using the `Round-to-Nearest` algorithm.\n\nNote that the model can be swapped for a local or remote HF-compatible checkpoint and the `recipe` may be changed to target different quantization algorithms or formats.\n\n### Apply Quantization\nQuantization is applied by selecting an algorithm and calling the `oneshot` API.\n\n```python\nfrom compressed_tensors.offload import dispatch_model\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nfrom llmcompressor import oneshot\nfrom llmcompressor.modifiers.quantization import QuantizationModifier\n\nMODEL_ID = \"Qwen\u002FQwen3-30B-A3B\"\n\n# Load model.\nmodel = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=\"auto\")\ntokenizer = AutoTokenizer.from_pretrained(MODEL_ID)\n\n# Configure the quantization algorithm and scheme.\n# In this case, we:\n#   * quantize the weights to FP8 using RTN with block_size 128\n#   * quantize the activations dynamically to FP8 during inference\nrecipe = QuantizationModifier(\n    targets=\"Linear\",\n    scheme=\"FP8_BLOCK\",\n    ignore=[\"lm_head\", \"re:.*mlp.gate$\"],\n)\n\n# Apply quantization.\noneshot(model=model, recipe=recipe)\n\n# Confirm generations of the quantized model look sane.\nprint(\"========== SAMPLE GENERATION ==============\")\ndispatch_model(model)\ninput_ids = tokenizer(\"Hello my name is\", return_tensors=\"pt\").input_ids.to(\n    model.device\n)\noutput = model.generate(input_ids, max_new_tokens=20)\nprint(tokenizer.decode(output[0]))\nprint(\"==========================================\")\n\n# Save to disk in compressed-tensors format.\nSAVE_DIR = MODEL_ID.split(\"\u002F\")[1] + \"-FP8-BLOCK\"\nmodel.save_pretrained(SAVE_DIR)\ntokenizer.save_pretrained(SAVE_DIR)\n```\n\n### Inference with vLLM\n\nThe checkpoints created by `llmcompressor` can be loaded and run in `vllm`:\n\nInstall:\n\n```bash\npip install vllm\n```\n\nRun:\n\n```python\nfrom vllm import LLM\nmodel = LLM(\"Qwen\u002FQwen3-30B-A3B-FP8-BLOCK\")\noutput = model.generate(\"My name is\")\n```\n\n## Questions \u002F Contribution\n\n- If you have any questions or requests open an [issue](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fllm-compressor\u002Fissues) and we will add an example or documentation.\n- We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here](CONTRIBUTING.md).\n\n## Citation\n\nIf you find LLM Compressor useful in your research or projects, please consider citing it:\n\n```bibtex\n@software{llmcompressor2024,\n    title={{LLM Compressor}},\n    author={Red Hat AI and vLLM Project},\n    year={2024},\n    month={8},\n    url={https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fllm-compressor},\n}\n```\n\n\n!!! warning\n    Sparse compression (24 sparsity) is no longer supported by LLM Compressor due to lack of hardware support and usage\n","LLM Compressor 是一个用于优化大语言模型（LLMs）部署的库，特别针对vLLM环境。它提供了包括权重、激活、KV缓存和注意力量化在内的一系列压缩算法和技术转换，支持与Hugging Face模型无缝集成，并以`compressed-tensors`格式保存模型，确保与vLLM兼容。此外，该库还支持分布式数据并行（DDP）和磁盘卸载功能，适用于处理非常大的模型。非常适合需要高效部署大型语言模型到生产环境的应用场景，如在线服务、云计算平台等。",2,"2026-06-11 03:41:20","high_star"]