[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71187":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":15,"starSnapshotCount":15,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},71187,"AutoGPTQ","AutoGPTQ\u002FAutoGPTQ","An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.","",null,"Python",5067,543,3,241,0,4,9,67.11,"MIT License",true,false,"main",[24,25,26,27,28,29,30,31,32],"deep-learning","inference","large-language-models","llms","nlp","pytorch","quantization","transformer","transformers","2026-06-12 04:00:59","\u003Ch1 align=\"center\"> 🚨 AutoGPTQ is unmaintained - we suggest using \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FModelCloud\u002FGPTQModel\">GPTQModel\u003C\u002Fa> for bug fixes and new models support 🚨 \n\u003C\u002Fh1>\n\n\u003Ch1 align=\"center\">AutoGPTQ\u003C\u002Fh1>\n\u003Cp align=\"center\">An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).\u003C\u002Fp>\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPanQiWei\u002FAutoGPTQ\u002Freleases\">\n        \u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002FPanQiWei\u002FAutoGPTQ.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fauto-gptq\u002F\">\n        \u003Cimg alt=\"PyPI - Downloads\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fdd\u002Fauto-gptq\">\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n\n## News or Update\n\n- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin) int4*fp16 matrix multiplication kernel support, with the argument `use_marlin=True` when loading models.\n- 2023-08-23 - (News) - 🤗 Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fgptq-integration) and it's resources for more details!\n\n*For more histories please turn to [here](docs\u002FNEWS_OR_UPDATE.md)*\n\n## Performance Comparison\n\n### Inference Speed\n> The result is generated using [this script](examples\u002Fbenchmark\u002Fgeneration_speed.py), batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens\u002Fs (the larger, the better).\n>\n> The quantized model is loaded using the setup that can gain the fastest inference speed.\n\n| model         | GPU           | num_beams | fp16  | gptq-int4 |\n|---------------|---------------|-----------|-------|-----------|\n| llama-7b      | 1xA100-40G    | 1         | 18.87 | 25.53     |\n| llama-7b      | 1xA100-40G    | 4         | 68.79 | 91.30     |\n| moss-moon 16b | 1xA100-40G    | 1         | 12.48 | 15.25     |\n| moss-moon 16b | 1xA100-40G    | 4         | OOM   | 42.67     |\n| moss-moon 16b | 2xA100-40G    | 1         | 06.83 | 06.78     |\n| moss-moon 16b | 2xA100-40G    | 4         | 13.10 | 10.80     |\n| gpt-j 6b      | 1xRTX3060-12G | 1         | OOM   | 29.55     |\n| gpt-j 6b      | 1xRTX3060-12G | 4         | OOM   | 47.36     |\n\n\n### Perplexity\nFor perplexity comparison, you can turn to [here](https:\u002F\u002Fgithub.com\u002Fqwopqwop200\u002FGPTQ-for-LLaMa#result) and [here](https:\u002F\u002Fgithub.com\u002Fqwopqwop200\u002FGPTQ-for-LLaMa#gptq-vs-bitsandbytes)\n\n## Installation\n\nAutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:\n\n| Platform version | Installation                                                                                      | Built against PyTorch |\n|-------------------|---------------------------------------------------------------------------------------------------|-----------------------|\n| CUDA 11.8         | `pip install auto-gptq --no-build-isolation --extra-index-url https:\u002F\u002Fhuggingface.github.io\u002Fautogptq-index\u002Fwhl\u002Fcu118\u002F`   | 2.2.1+cu118           |\n| CUDA 12.1         | `pip install auto-gptq --no-build-isolation`                                                                            | 2.2.1+cu121           |\n| ROCm 5.7          | `pip install auto-gptq --no-build-isolation --extra-index-url https:\u002F\u002Fhuggingface.github.io\u002Fautogptq-index\u002Fwhl\u002Frocm573\u002F` | 2.2.1+rocm5.7\n\nAutoGPTQ can be installed with the Triton dependency with `pip install auto-gptq[triton] --no-build-isolation` in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).\n\nFor older AutoGPTQ, please refer to [the previous releases installation table](docs\u002FINSTALLATION.md).\n\nOn NVIDIA systems, AutoGPTQ does not support [Maxwell or lower](https:\u002F\u002Fqiita.com\u002Fuyuni\u002Fitems\u002F733a93b975b524f89f46) GPUs.\n\n### Install from source\n\nClone the source code:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FPanQiWei\u002FAutoGPTQ.git && cd AutoGPTQ\n```\n\nA few packages are required in order to build from source: `pip install numpy gekko pandas`.\n\nThen, install locally from source:\n```bash\npip install -vvv --no-build-isolation -e .\n```\nYou can set `BUILD_CUDA_EXT=0` to disable pytorch extension building, but this is **strongly discouraged** as AutoGPTQ then falls back on a slow python implementation.\n\nAs a last resort, if the above command fails, you can try `python setup.py install`.\n\n#### On ROCm systems\n\nTo install from source for AMD GPUs supporting ROCm, please specify the `ROCM_VERSION` environment variable. Example:\n\n```bash\nROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .\n```\n\nThe compilation can be speeded up by specifying the `PYTORCH_ROCM_ARCH` variable ([reference](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fblob\u002F7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba\u002Fsetup.py#L132)) in order to build for a single target device, for example `gfx90a` for MI200 series devices.\n\nFor ROCm systems, the packages `rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev` are required to build.\n\n#### On Intel® Gaudi® 2 systems\n\n>Notice: make sure you're in commit 65c2e15 or later\n\nTo install from source for Intel Gaudi 2 HPUs, set the `BUILD_CUDA_EXT=0` environment variable to disable building the CUDA PyTorch extension. Example:\n\n```bash\nBUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .\n```\n\n>Notice that Intel Gaudi 2 uses an optimized kernel upon inference, and requires `BUILD_CUDA_EXT=0` on non-CUDA machines.\n\n## Quick Tour\n\n### Quantization and Inference\n> warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.\n\nBelow is an example for the simplest use of `auto_gptq` to quantize a model and inference after quantization:\n```python\nfrom transformers import AutoTokenizer, TextGenerationPipeline\nfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig\nimport logging\n\nlogging.basicConfig(\n    format=\"%(asctime)s %(levelname)s [%(name)s] %(message)s\", level=logging.INFO, datefmt=\"%Y-%m-%d %H:%M:%S\"\n)\n\npretrained_model_dir = \"facebook\u002Fopt-125m\"\nquantized_model_dir = \"opt-125m-4bit\"\n\ntokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)\nexamples = [\n    tokenizer(\n        \"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.\"\n    )\n]\n\nquantize_config = BaseQuantizeConfig(\n    bits=4,  # quantize model to 4-bit\n    group_size=128,  # it is recommended to set the value to 128\n    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad\n)\n\n# load un-quantized model, by default, the model will always be loaded into CPU memory\nmodel = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)\n\n# quantize model, the examples should be list of dict whose keys can only be \"input_ids\" and \"attention_mask\"\nmodel.quantize(examples)\n\n# save quantized model\nmodel.save_quantized(quantized_model_dir)\n\n# save quantized model using safetensors\nmodel.save_quantized(quantized_model_dir, use_safetensors=True)\n\n# push quantized model to Hugging Face Hub.\n# to use use_auth_token=True, Login first via huggingface-cli login.\n# or pass explcit token with: use_auth_token=\"hf_xxxxxxx\"\n# (uncomment the following three lines to enable this feature)\n# repo_id = f\"YourUserName\u002F{quantized_model_dir}\"\n# commit_message = f\"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}\"\n# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)\n\n# alternatively you can save and push at the same time\n# (uncomment the following three lines to enable this feature)\n# repo_id = f\"YourUserName\u002F{quantized_model_dir}\"\n# commit_message = f\"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}\"\n# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)\n\n# load quantized model to the first GPU\nmodel = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=\"cuda:0\")\n\n# download quantized model from Hugging Face Hub and load to the first GPU\n# model = AutoGPTQForCausalLM.from_quantized(repo_id, device=\"cuda:0\", use_safetensors=True, use_triton=False)\n\n# inference with model.generate\nprint(tokenizer.decode(model.generate(**tokenizer(\"auto_gptq is\", return_tensors=\"pt\").to(model.device))[0]))\n\n# or you can also use pipeline\npipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)\nprint(pipeline(\"auto-gptq is\")[0][\"generated_text\"])\n```\n\nFor more advanced features of model quantization, please reference to [this script](examples\u002Fquantization\u002Fquant_with_alpaca.py)\n\n### Customize Model\n\u003Cdetails>\n\n\u003Csummary>Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:\u003C\u002Fsummary>\n\n```python\nfrom auto_gptq.modeling import BaseGPTQForCausalLM\n\n\nclass OPTGPTQForCausalLM(BaseGPTQForCausalLM):\n    # chained attribute name of transformer layer block\n    layers_block_name = \"model.decoder.layers\"\n    # chained attribute names of other nn modules that in the same level as the transformer layer block\n    outside_layer_modules = [\n        \"model.decoder.embed_tokens\", \"model.decoder.embed_positions\", \"model.decoder.project_out\",\n        \"model.decoder.project_in\", \"model.decoder.final_layer_norm\"\n    ]\n    # chained attribute names of linear layers in transformer layer module\n    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,\n    # and the order should be the order when they are truly executed, in this case (and usually in most cases),\n    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output\n    inside_layer_modules = [\n        [\"self_attn.k_proj\", \"self_attn.v_proj\", \"self_attn.q_proj\"],\n        [\"self_attn.out_proj\"],\n        [\"fc1\"],\n        [\"fc2\"]\n    ]\n```\nAfter this, you can use `OPTGPTQForCausalLM.from_pretrained` and other methods as shown in Basic.\n\n\u003C\u002Fdetails>\n\n### Evaluation on Downstream Tasks\nYou can use tasks defined in `auto_gptq.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.\n\nThe predefined tasks support all causal-language-models implemented in [🤗 transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) and in this project.\n\n\u003Cdetails>\n\n\u003Csummary>Below is an example to evaluate `EleutherAI\u002Fgpt-j-6b` on sequence-classification task using `cardiffnlp\u002Ftweet_sentiment_multilingual` dataset:\u003C\u002Fsummary>\n\n```python\nfrom functools import partial\n\nimport datasets\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig\n\nfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig\nfrom auto_gptq.eval_tasks import SequenceClassificationTask\n\n\nMODEL = \"EleutherAI\u002Fgpt-j-6b\"\nDATASET = \"cardiffnlp\u002Ftweet_sentiment_multilingual\"\nTEMPLATE = \"Question:What's the sentiment of the given text? Choices are {labels}.\\nText: {text}\\nAnswer:\"\nID2LABEL = {\n    0: \"negative\",\n    1: \"neutral\",\n    2: \"positive\"\n}\nLABELS = list(ID2LABEL.values())\n\n\ndef ds_refactor_fn(samples):\n    text_data = samples[\"text\"]\n    label_data = samples[\"label\"]\n\n    new_samples = {\"prompt\": [], \"label\": []}\n    for text, label in zip(text_data, label_data):\n        prompt = TEMPLATE.format(labels=LABELS, text=text)\n        new_samples[\"prompt\"].append(prompt)\n        new_samples[\"label\"].append(ID2LABEL[label])\n\n    return new_samples\n\n\n#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to(\"cuda:0\")\nmodel = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())\ntokenizer = AutoTokenizer.from_pretrained(MODEL)\n\ntask = SequenceClassificationTask(\n        model=model,\n        tokenizer=tokenizer,\n        classes=LABELS,\n        data_name_or_path=DATASET,\n        prompt_col_name=\"prompt\",\n        label_col_name=\"label\",\n        **{\n            \"num_samples\": 1000,  # how many samples will be sampled to evaluation\n            \"sample_max_len\": 1024,  # max tokens for each sample\n            \"block_max_len\": 2048,  # max tokens for each data block\n            # function to load dataset, one must only accept data_name_or_path as input\n            # and return datasets.Dataset\n            \"load_fn\": partial(datasets.load_dataset, name=\"english\"),\n            # function to preprocess dataset, which is used for datasets.Dataset.map,\n            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]\n            \"preprocess_fn\": ds_refactor_fn,\n            # truncate label when sample's length exceed sample_max_len\n            \"truncate_prompt\": False\n        }\n    )\n\n# note that max_new_tokens will be automatically specified internally based on given classes\nprint(task.run())\n\n# self-consistency\nprint(\n    task.run(\n        generation_config=GenerationConfig(\n            num_beams=3,\n            num_return_sequences=3,\n            do_sample=True\n        )\n    )\n)\n```\n\n\u003C\u002Fdetails>\n\n## Learn More\n[tutorials](docs\u002Ftutorial) provide step-by-step guidance to integrate `auto_gptq` with your own project and some best practice principles.\n\n[examples](examples\u002FREADME.md) provide plenty of example scripts to use `auto_gptq` in different ways.\n\n## Supported Models\n\n> you can use `model.config.model_type` to compare with the table below to check whether the model you use is supported by `auto_gptq`.\n>\n> for example, model_type of `WizardLM`, `vicuna` and `gpt4all` are all `llama`, hence they are all supported by `auto_gptq`.\n\n| model type                         | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt                                                                            |\n|------------------------------------|--------------|-----------|-----------|---------------|-------------------------------------------------------------------------------------------------|\n| bloom                              | ✅            | ✅         | ✅         | ✅             |                                                                                                 |\n| gpt2                               | ✅            | ✅         | ✅         | ✅             |                                                                                                 |\n| gpt_neox                           | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002Fpeft\u002Ftree\u002Fmulti_modal_adaption_prompt) |\n| gptj                               | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002Fpeft\u002Ftree\u002Fmulti_modal_adaption_prompt) |\n| llama                              | ✅            | ✅         | ✅         | ✅             | ✅                                                                                               |\n| moss                               | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002Fpeft\u002Ftree\u002Fmulti_modal_adaption_prompt) |\n| opt                                | ✅            | ✅         | ✅         | ✅             |                                                                                                 |\n| gpt_bigcode                        | ✅            | ✅         | ✅         | ✅             |                                                                                                 |\n| codegen                            | ✅            | ✅         | ✅         | ✅             |                                                                                                 |\n| falcon(RefinedWebModel\u002FRefinedWeb) | ✅            | ✅         | ✅         | ✅             |                                                                                                 |\n\n## Supported Evaluation Tasks\nCurrently, `auto_gptq` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!\n\n## Running tests\n\nTests can be run with:\n\n```\npytest tests\u002F -s\n```\n\n## FAQ\n\n### Which kernel is used by default?\n\nAutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.\n\n### How to use Marlin kernel?\n\nMarlin is an optimized int4 * fp16 kernel was recently proposed at https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin. This is integrated in AutoGPTQ when loading a model with `use_marlin=True`. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).\n\n## Acknowledgement\n- Special thanks **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh** for proposing **GPTQ** algorithm and open source the [code](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq), and for releasing [Marlin kernel](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin) for mixed precision computation.\n- Special thanks **qwopqwop200**, for code in this project that relevant to quantization are mainly referenced from [GPTQ-for-LLaMa](https:\u002F\u002Fgithub.com\u002Fqwopqwop200\u002FGPTQ-for-LLaMa\u002Ftree\u002Fcuda).\n- Special thanks to **turboderp**, for releasing [Exllama](https:\u002F\u002Fgithub.com\u002Fturboderp\u002Fexllama) and [Exllama v2](https:\u002F\u002Fgithub.com\u002Fturboderp\u002Fexllamav2) libraries with efficient mixed precision kernels.\n","AutoGPTQ 是一个基于 GPTQ 算法（仅权重量化）的易于使用的大型语言模型量化工具包，提供了用户友好的 API。该项目通过实现高效的量化技术来减少模型大小并提高推理速度，同时保持较高的准确率。它支持多种预训练模型，并且与 Hugging Face 的 Transformers 库集成，使得运行和训练量化后的 GPTQ 模型变得更加便捷。AutoGPTQ 适用于需要优化资源使用效率或加速推理过程的应用场景，比如在计算资源有限的设备上部署大模型、降低云服务成本等。不过需要注意的是，目前该项目已不再维护，建议转而使用 GPTQModel 获取最新的功能支持和错误修复。",2,"2026-06-11 03:36:28","high_star"]