[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74313":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":15,"stars30d":15,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},74313,"MindPipe","MAC-AutoML\u002FMindPipe","MAC-AutoML","A powerful model compression framework for LLMs and LVLMs, adapted for NVIDIA GPUs and Huawei Ascend NPUs.","",null,"Python",1008,24,4,3,0,50.99,false,"main",[21,22,23,24,25,26,27,28,29,30,31,32,33,34],"automatic-compression","compression","deployment","evaluation","huawei-ascend-npus","large-language-models","large-vision-language-models","llama","llava","minicpm","nvidia-gpus","pruning","quantization","qwen","2026-06-12 04:01:14","\u003Cdiv align=\"center\">\n\n# 🧠 MindPipe\n\n**A Unified Compression & Evaluation Framework for LLMs and VLMs**\n\n[![Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10%2B-blue?logo=python&logoColor=white)](https:\u002F\u002Fpython.org)\n[![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.0%2B-EE4C2C?logo=pytorch&logoColor=white)](https:\u002F\u002Fpytorch.org)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hugging%20Face-Transformers-orange)](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green.svg)](LICENSE)\n[![NPU Ready](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNPU-Ready-purple)]()\n\n[English](README.md) | [中文](README_zh.md)\n\n\u003Cp align=\"center\">\n  \u003Cem>One CLI. 11 quantization methods. 7 pruning methods. GPU & NPU. Text & Vision.\u003C\u002Fem>\n\u003C\u002Fp>\n\n---\n\n**Quantize** · **Prune** · **Evaluate** · **Reproduce**\n\n\u003C\u002Fdiv>\n\n## ✨ Why MindPipe?\n\n> Most compression tools only handle one technique on one type of model.  \n> **MindPipe unifies them all under a single, reproducible pipeline.**\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"50%\">\n\n### 🎯 One Entrypoint, All Methods\nA single `main.py` drives quantization, pruning, combined workflows, and evaluation — no juggling scripts.\n\n### 🔀 GPU + NPU\nFirst-class support for both CUDA GPUs and Ascend NPUs with shared device abstraction.\n\n### 📊 Integrated Evaluation\nPPL, lm-eval-harness zero-shot, and VLMEvalKit multimodal benchmarks — all built in.\n\n\u003C\u002Ftd>\n\u003Ctd width=\"50%\">\n\n### 🧩 Modular & Extensible\nClean registry-based architecture makes adding new algorithms straightforward.\n\n### 🔬 Reproducibility First\nJSON artifacts, batch scripts, and per-run metrics ensure every result is traceable.\n\n### 👁️ Vision-Language Native\nNot an afterthought — VLMs are first-class citizens with dedicated multimodal eval.\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n---\n\n## 🚀 Quick Start\n\n```bash\n# 1. Setup\nconda activate mindpipe\ngit submodule update --init --recursive\npip install -r requirements.txt\n\n# 2. Quantize a model (AWQ W4A16)\nCUDA_VISIBLE_DEVICES=0 python main.py \\\n  --quantization awq \\\n  --model_path \u002Fpath\u002Fto\u002Fmodel \\\n  --device_map auto \\\n  --dtype float16 \\\n  --calibration_dataset pileval \\\n  --calibration_samples 128 \\\n  --sequence_length 2048 \\\n  --weight_bits 4 \\\n  --group_size 128 \\\n  --eval_ppl true \\\n  --output_dir .\u002Fresults\u002Fawq\n\n# 3. Prune a model (Wanda 50% sparsity)\nCUDA_VISIBLE_DEVICES=0 python main.py \\\n  --pruning wanda \\\n  --model_path \u002Fpath\u002Fto\u002Fmodel \\\n  --device_map auto \\\n  --dtype float16 \\\n  --calibration_dataset c4 \\\n  --calibration_samples 128 \\\n  --sparsity_ratio 0.5 \\\n  --eval_ppl true \\\n  --output_dir .\u002Fresults\u002Fwanda\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>📋 More Examples (Click to Expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n### Full-Precision Evaluation\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python main.py \\\n  --model_path \u002Fpath\u002Fto\u002Fmodel \\\n  --device_map auto \\\n  --dtype float16 \\\n  --attn_implementation sdpa \\\n  --evaluation_dataset wikitext2 \\\n  --sequence_length 2048 \\\n  --batch_size 1 \\\n  --max_eval_chunks 64 \\\n  --eval_ppl true \\\n  --eval_zero_shot true \\\n  --zero_shot_tasks boolq piqa rte winogrande arc_easy arc_challenge openbookqa \\\n  --zero_shot_num_fewshot 0 \\\n  --zero_shot_batch_size 1 \\\n  --output_dir .\u002Fresults\u002Ffp_eval\n```\n\n### GPTQ Quantization\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python main.py \\\n  --quantization gptq \\\n  --model_path \u002Fpath\u002Fto\u002Fmodel \\\n  --device_map auto \\\n  --dtype float16 \\\n  --attn_implementation sdpa \\\n  --calibration_dataset pileval \\\n  --evaluation_dataset wikitext2 \\\n  --calibration_samples 128 \\\n  --sequence_length 2048 \\\n  --weight_bits 4 \\\n  --activation_bits 16 \\\n  --group_size 128 \\\n  --weight_group_size 128 \\\n  --eval_ppl true \\\n  --output_dir .\u002Fresults\u002Fgptq\n```\n\n### Pruning + Quantization Pipeline\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1 python main.py \\\n  --pruning wanda_sp \\\n  --quantization gptq \\\n  --execution_order pruning_then_quantization \\\n  --model_path \u002Fpath\u002Fto\u002Fmodel \\\n  --device_map auto \\\n  --dtype float16 \\\n  --attn_implementation sdpa \\\n  --calibration_dataset c4 \\\n  --calibration_samples 128 \\\n  --sequence_length 2048 \\\n  --sparsity_ratio 0.2 \\\n  --weight_bits 4 \\\n  --group_size 128 \\\n  --eval_ppl true \\\n  --output_dir .\u002Fresults\u002Fworkflow\n```\n\n### VLM Multimodal Evaluation\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python main.py \\\n  --model_path \u002Fpath\u002Fto\u002Fvlm \\\n  --device_map auto \\\n  --dtype float16 \\\n  --attn_implementation sdpa \\\n  --eval_ppl false \\\n  --eval_zero_shot false \\\n  --eval_vlm true \\\n  --vlm_datasets OCRBench TextVQA_VAL ChartQA_TEST InfoVQA_VAL \\\n  --vlm_mode all \\\n  --vlm_api_nproc 1 \\\n  --vlm_eval_kit_root \u002Fpath\u002Fto\u002FVLMEvalKit \\\n  --output_dir .\u002Fresults\u002Fvlm_eval\n```\n\n\u003C\u002Fdetails>\n\n---\n\n## 📦 Supported Algorithms\n\nUse the method identifiers in the CLI column as command-line values. Display\nnames such as QA-LoRA, LLM-Pruner, and Wanda-SP are descriptive; the actual CLI\nvalues are `qalora`, `llm_pruner`, and `wanda_sp`.\n\n### Quantization (11 Methods)\n\n| Method | CLI | Family | Technique | NPU |\n|:------:|:---:|:------:|:---------:|:---:|\n| **AWQ** | `awq` | PTQ | Weight-only with activation-aware scaling | ✅ |\n| **GPTQ** | `gptq` | PTQ | Weight-only GPTQ quantization | ✅ |\n| **MQuant** | `mquant` | PTQ | Multimodal GPTQ\u002FAWQ for language & visual branches | ⏳ |\n| **OmniQuant** | `omniquant` | PTQ | Learnable weight & activation transformation | ✅ |\n| **QuaRot** | `quarot` | PTQ | Rotation-based W\u002FA\u002FKV quantization | ⏳ |\n| **SmoothQuant** | `smoothquant` | PTQ | Activation smoothing for W\u002FA quantization | ✅ |\n| **SpinQuant** | `spinquant` | PTQ | Rotation-based W\u002FA\u002FKV with SpinQuant hooks | ⏳ |\n| **FlatQuant** | `flatquant` | QAT | Trainable transformations | ✅ |\n| **QLoRA** | `qlora` | QAT | Low-bit fake-quant adapter training | ✅ |\n| **QA-LoRA** | `qalora` | QAT | Group-pooled adapter training | 🔶 |\n| **SplitQuant** | `splitquant` | QAT | SplitQuant-style trainable transformations | ✅ |\n\n### Pruning (7 Methods)\n\n| Method | CLI | Type | Calibration | NPU |\n|:------:|:---:|:----:|:-----------:|:---:|\n| **ALPS** | `alps` | Unstructured \u002F n:m | `c4` | ✅ |\n| **FLAP** | `flap` | Structured | `wikitext2` | ✅ |\n| **LLM-Pruner** | `llm_pruner` | Structured | `c4` | ✅ |\n| **ShortGPT** | `shortgpt` | Layer pruning | `pg19` | ✅ |\n| **SparseGPT** | `sparsegpt` | Unstructured \u002F n:m | `c4` | ✅ |\n| **Wanda** | `wanda` | Unstructured \u002F n:m | `c4` | ✅ |\n| **Wanda-SP** | `wanda_sp` | Structured | `c4` | ✅ |\n\n> ✅ Ready &nbsp;|&nbsp; ⏳ In Progress &nbsp;|&nbsp; 🔶 CUDA Only\n\n---\n\n## 🏗️ Architecture\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│                         main.py (CLI)                           │\n├─────────────────────────────────────────────────────────────────┤\n│                    workflow\u002F (Config + Executor)                │\n├──────────────────┬──────────────────┬───────────────────────────┤\n│   Quantization   │     Pruning      │       Evaluation          │\n│  ┌────────────┐  │  ┌────────────┐  │  ┌─────────────────────┐  │\n│  │ PTQ  (7)   │  │  │Structured  │  │  │ PPL (wikitext2\u002Fc4)  │  │\n│  │ QAT  (4)   │  │  │Unstructured│  │  │ Zero-shot (lm-eval) │  │\n│  └────────────┘  │  │Layer Prune │  │  │ VLM (VLMEvalKit)    │  │\n│                  │  └────────────┘  │  └─────────────────────┘  │\n├──────────────────┴──────────────────┴───────────────────────────┤\n│              algorithm\u002Fcommon\u002F (Shared Infrastructure)          │\n│     Model Loading · Data · Device (GPU\u002FNPU) · IO · Metrics      │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n### Repository Layout\n\n```\nMindPipe\u002F\n├── main.py                         # Unified CLI entrypoint\n├── algorithm\u002F\n│   ├── common\u002F                     # Shared model, data, device, IO utilities\n│   ├── quantization\u002F\n│   │   ├── ptq\u002F                    # AWQ, GPTQ, MQuant, OmniQuant, QuaRot, SmoothQuant, SpinQuant\n│   │   └── qat\u002F                    # FlatQuant, QLoRA, QA-LoRA, SplitQuant\n│   └── pruning\u002F\n│       ├── structured\u002F             # FLAP, LLM-Pruner, ShortGPT, Wanda-SP\n│       └── unstructured\u002F           # ALPS, SparseGPT, Wanda\n├── workflow\u002F                       # CLI config builder and stage executor\n├── evaluation\u002F                     # PPL, lm-eval-harness, and VLMEvalKit runners\n├── configs\u002F                        # Shared and algorithm-specific configs\n├── scripts\u002F                        # Batch and reproducibility scripts\n└── third_party\u002F                    # Optional external evaluation tools\n```\n\n---\n\n## 🤖 Model Coverage\n\n\u003Ctable>\n\u003Ctr>\u003Ctd>\n\n| Model Family | Text | Vision |\n|:------------|:----:|:------:|\n| LLaMA-2 \u002F LLaMA-3 | ✅ | — |\n| Qwen2.5 | ✅ | — |\n| Qwen3 | ✅ | — |\n| Qwen3.5 | ✅ | — |\n\n\u003C\u002Ftd>\u003Ctd>\n\n| Model Family | Text | Vision |\n|:------------|:----:|:------:|\n| Qwen2-VL | ✅ | ✅ |\n| Qwen2.5-VL | ✅ | ✅ |\n| Qwen3-VL | ✅ | ✅ |\n| MiniCPM-V | ✅ | ✅ |\n| LLaVA \u002F InternVL | ✅ | 🔶 |\n\n\u003C\u002Ftd>\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n> **Note:** Model support is algorithm-dependent. Check `algorithm\u002Fquantization\u002F*\u002F*\u002Fmethod.py` or `algorithm\u002Fpruning\u002F*\u002F*\u002Fmethod.py` for exact coverage.\n\n---\n\n## 📖 Configuration Reference\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>⚙️ Common Arguments\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Argument | Default | Description |\n|:---------|:--------|:------------|\n| `--model_path` | Required | Local or Hugging Face model path |\n| `--device` | `auto` | Logical device used by runtime helpers |\n| `--device_map` | `None` | Required for pruning\u002Fquantization (`auto` recommended) |\n| `--dtype` | `bfloat16` | `auto`, `float16`, or `bfloat16` |\n| `--attn_implementation` | `flash_attention_2` | `flash_attention_2`, `sdpa`, or `eager` |\n| `--calibration_dataset` | Method default | `wikitext2`, `c4`, `pileval`, `pg19`, or `bookcorpus` |\n| `--evaluation_dataset` | `wikitext2` | Dataset used for PPL evaluation |\n| `--calibration_samples` | `128` | Number of calibration samples |\n| `--sequence_length` | `2048` | Sequence length for calibration and evaluation |\n| `--batch_size` | `1` | PPL batch size |\n| `--max_eval_chunks` | `64` | Optional cap for PPL chunks |\n| `--eval_ppl` | `false` | Enable perplexity evaluation |\n| `--eval_zero_shot` | `false` | Enable lm-eval-harness tasks |\n| `--eval_vlm` | `false` | Enable VLMEvalKit evaluation |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>🔢 Quantization Arguments\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Argument | Default | Description |\n|:---------|:--------|:------------|\n| `--quantization` | `None` | One of the registered quantization methods |\n| `--weight_bits` | `4` | Weight quantization bit width |\n| `--activation_bits` | `16` | Activation quantization bit width |\n| `--query_bits` | `16` | Query activation bit width |\n| `--key_bits` | `16` | Key cache bit width |\n| `--value_bits` | `16` | Value cache bit width |\n| `--group_size` | `128` | Default group size |\n| `--weight_group_size` | `None` | Overrides weight group size |\n| `--activation_group_size` | `None` | Overrides activation group size |\n| `--kv_group_size` | `None` | Overrides KV group size |\n| `--weight_method` | `gptq` | Weight method for methods supporting GPTQ\u002FRTN |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>✂️ Pruning Arguments\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Argument | Default | Description |\n|:---------|:--------|:------------|\n| `--pruning` | `None` | One of the registered pruning methods |\n| `--sparsity_ratio` | `0.5` | Target sparsity ratio |\n| `--structure_pattern` | `unstructured` | `unstructured`, `2:4`, or `4:8` |\n| `--block_size` | `128` | Block size for supported pruning methods |\n| `--damp_percent` | `0.01` | Hessian damping ratio for second-order methods |\n\n\u003C\u002Fdetails>\n\n---\n\n## 🔧 Installation\n\n### Prerequisites\n\n- Python 3.10+\n- PyTorch 2.0+\n- CUDA 11.8+ (for GPU) or Ascend CANN (for NPU)\n\n### Setup\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002Fyour-org\u002FMindPipe.git\ncd MindPipe\n\n# Create environment\nconda create -n mindpipe python=3.10 -y\nconda activate mindpipe\n\n# Install dependencies\ngit submodule update --init --recursive\npip install -r requirements.txt\n```\n\n### Optional: VLMEvalKit\n\nFor multimodal evaluation, initialize the VLMEvalKit submodule or set `VLMEVALKIT_ROOT`:\n\n```bash\ngit submodule update --init third_party\u002FVLMEvalKit\n# or\nexport VLMEVALKIT_ROOT=\u002Fpath\u002Fto\u002Fexisting\u002FVLMEvalKit\n```\n\n---\n\n## 📈 Reproducibility\n\nThe `scripts\u002Frepro\u002F` directory contains ready-to-use benchmark launchers:\n\n```bash\n# Dry run (print commands without executing)\nDRY_RUN=true bash scripts\u002Frepro\u002Frun_qlora_adapted_models_text_suite.sh\n\n# Filter specific models\nMODEL_FILTER=qwen3 bash scripts\u002Frepro\u002Frun_mquantpp_awq_vlm_serial_suite.sh\n```\n\nAvailable scripts include:\n- `run_qlora_adapted_models_text_suite.sh`\n- `run_qalora_adapted_models_text_suite.sh`\n- `run_mquantpp_awq_vlm_serial_suite.sh`\n- `run_qwen2_5_vl_gptq_vlm_suite.sh`\n- `run_qwen3_vl_2b_gptq_suite.sh`\n\n---\n\n## 📂 Output Structure\n\n```\nresults\u002F\n├── \u003Cmodel>\u002F\u003Calgorithm>\u002F\u003Crun_spec>\u002F\n│   ├── metrics.json        # Evaluation results & run metadata\n│   └── artifacts.json      # Algorithm details, calibration settings, checkpoints\n└── \u003Cmodel>\u002F\u003Cexecution_order>\u002F\u003Calgorithm1>__\u003Calgorithm2>\u002F\u003Crun_spec>\u002F\n    └── metrics.json\n```\n\n---\n\n## ⚠️ Known Limitations\n\n| Limitation | Status |\n|:-----------|:------:|\n| QuaRot \u002F SpinQuant not NPU-ready | ⏳ |\n| MQuant GPU-only | ⏳ |\n| QA-LoRA CUDA-only, no AutoGPTQ export | 🔶 |\n| QLoRA W2\u002FW3 use fake-quant fallback on NPU | ℹ️ |\n| Custom runtime wrapper reload is method-dependent | ℹ️ |\n\n---\n\n## 📜 Citation & Acknowledgements\n\nMindPipe builds upon the following outstanding research. Please cite the original papers when using their methods:\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Click to see referenced works\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **AWQ** — Activation-aware Weight Quantization\n- **GPTQ** — Accurate Post-Training Quantization for Generative Pre-trained Transformers\n- **QuaRot** — Outlier-Free Quantization via Rotations\n- **SpinQuant** — Rotation-Based Quantization\n- **FlatQuant** — Flatness-Aware Quantization\n- **SmoothQuant** — Accurate and Efficient Post-Training Quantization\n- **OmniQuant** — Omnidirectionally Calibrated Quantization\n- **SplitQuant** — Split Quantization\n- **QLoRA** — Efficient Finetuning of Quantized LLMs\n- **QA-LoRA** — Quantization-Aware Low-Rank Adaptation\n- **Wanda** — Pruning by Weights and Activations\n- **SparseGPT** — Massive Language Models Can Be Accurately Pruned in One-Shot\n- **FLAP** — Fluctuation-based Adaptive Structured Pruning\n- **ShortGPT** — Layers in LLMs are More Redundant Than You Expect\n- **LLM-Pruner** — On the Structural Pruning of Large Language Models\n- **ALPS** — Adaptive Layer-wise Pruning and Sparsification\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Cdiv align=\"center\">\n\n**⭐ If you find MindPipe useful, please consider giving it a star!**\n\n*Built with ❤️ for the model compression community*\n\n\u003C\u002Fdiv>\n","MindPipe 是一个强大的模型压缩框架，适用于大型语言模型（LLMs）和视觉-语言模型（VLMs），支持NVIDIA GPU和华为Ascend NPU。其核心功能包括量化、剪枝及评估，通过统一的命令行接口提供11种量化方法和7种剪枝方法。该框架基于Python和PyTorch开发，采用模块化设计便于扩展，并强调实验结果的可复现性。内置了多种评估基准，如PPL、lm-eval-harness零样本测试和VLMEvalKit多模态评测。MindPipe特别适合需要在保持模型性能的同时减少计算资源消耗的应用场景，如边缘设备部署或数据中心优化。",2,"2026-06-11 03:49:56","high_star"]