[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72717":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":40,"readmeContent":41,"aiSummary":42,"trendingCount":16,"starSnapshotCount":16,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},72717,"airllm","lyogavin\u002Fairllm","lyogavin","AirLLM 70B inference with single 4GB GPU","",null,"Jupyter Notebook",19757,2202,220,139,0,276,1323,1505,828,120,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39],"chinese-llm","chinese-nlp","finetune","generative-ai","instruct-gpt","instruction-set","llama","llm","lora","open-models","open-source","open-source-models","qlora","2026-06-12 04:01:07","![airllm_logo](https:\u002F\u002Fgithub.com\u002Flyogavin\u002Fairllm\u002Fblob\u002Fmain\u002Fassets\u002Fairllm_logo_sm.png?v=3&raw=true)\n\n[**Quickstart**](#quickstart) | \n[**Configurations**](#configurations) | \n[**MacOS**](#macos) | \n[**Example notebooks**](#example-python-notebook) | \n[**FAQ**](#faq)\n\n**AirLLM** optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run **405B Llama3.1** on **8GB vram** now.\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flyogavin\u002Fairllm\u002Fstargazers\">![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flyogavin\u002Fairllm?style=social)\u003C\u002Fa>\n[![Downloads](https:\u002F\u002Fstatic.pepy.tech\u002Fpersonalized-badge\u002Fairllm?period=total&units=international_system&left_color=grey&right_color=blue&left_text=downloads)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fairllm)\n\n[![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg)](https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Fblob\u002Fmain\u002FLICENSE)\n[![Generic badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fwechat-Anima-brightgreen?logo=wechat)](https:\u002F\u002Fstatic.aicompose.cn\u002Fstatic\u002Fwecom_barcode.png?t=1671918938)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1175437549783760896?logo=discord&color=7289da\n)](https:\u002F\u002Fdiscord.gg\u002F2xffU5sn)\n[![PyPI - AirLLM](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fformat\u002Fairllm?logo=pypi&color=3571a3)\n](https:\u002F\u002Fpypi.org\u002Fproject\u002Fairllm\u002F)\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fwebsite?up_message=blog&url=https%3A%2F%2Fmedium.com%2F%40lyo.gavin&logo=medium&color=black)](https:\u002F\u002Fmedium.com\u002F@lyo.gavin)\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGavin_Li-Blog-blue)](https:\u002F\u002Fgavinliblog.com)\n[![Support me on Patreon](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https%3A%2F%2Fshieldsio-patreon.vercel.app%2Fapi%3Fusername%3Dgavinli%26type%3Dpatrons&style=flat)](https:\u002F\u002Fpatreon.com\u002Fgavinli)\n[![GitHub Sponsors](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fsponsors\u002Flyogavin?logo=GitHub&color=lightgray)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Flyogavin)\n\n## AI Agents Recommendation:\n\n* [Best AI Game Sprite Generator](https:\u002F\u002Fgodmodeai.co)\n\n* [Best AI Facial Expression Editor](https:\u002F\u002Fcrazyfaceai.com)\n\n## Updates\n[2024\u002F08\u002F20] v2.11.0: Support Qwen2.5\n\n[2024\u002F08\u002F18] v2.10.1 Support CPU inference. Support non sharded models. Thanks @NavodPeiris for the great work! \n\n[2024\u002F07\u002F30] Support Llama3.1 **405B** ([example notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Flyogavin\u002Fairllm\u002Fblob\u002Fmain\u002Fair_llm\u002Fexamples\u002Frun_llama3.1_405B.ipynb)). Support **8bit\u002F4bit quantization**.\n\n[2024\u002F04\u002F20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.\n\n[2023\u002F12\u002F25] v2.8.2: Support MacOS running 70B large language models.\n\n[2023\u002F12\u002F20] v2.7: Support AirLLMMixtral. \n\n[2023\u002F12\u002F20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.\n\n[2023\u002F12\u002F18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.\n\n[2023\u002F12\u002F03] added support of **ChatGLM**, **QWen**, **Baichuan**, **Mistral**, **InternLM**!\n\n[2023\u002F12\u002F02] added support for safetensors. Now support all top 10 models in open llm leaderboard.\n\n[2023\u002F12\u002F01] airllm 2.0. Support compressions: **3x run time speed up!**\n\n[2023\u002F11\u002F20] airllm Initial version!\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=lyogavin\u002Fairllm&type=Timeline)](https:\u002F\u002Fstar-history.com\u002F#lyogavin\u002Fairllm&Timeline)\n\n## Table of Contents\n\n* [Quick start](#quickstart)\n* [Model Compression](#model-compression---3x-inference-speed-up)\n* [Configurations](#configurations)\n* [Run on MacOS](#macos)\n* [Example notebooks](#example-python-notebook)\n* [Supported Models](#supported-models)\n* [Acknowledgement](#acknowledgement)\n* [FAQ](#faq)\n\n## Quickstart\n\n### 1. Install package\n\nFirst, install the airllm pip package.\n\n```bash\npip install airllm\n```\n\n### 2. Inference\n\nThen, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.\n\n(*You can also specify the path to save the splitted layered model through **layer_shards_saving_path** when init AirLLMLlama2.*\n\n```python\nfrom airllm import AutoModel\n\nMAX_LENGTH = 128\n# could use hugging face model repo id:\nmodel = AutoModel.from_pretrained(\"garage-bAInd\u002FPlatypus2-70B-instruct\")\n\n# or use model's local path...\n#model = AutoModel.from_pretrained(\"\u002Fhome\u002Fubuntu\u002F.cache\u002Fhuggingface\u002Fhub\u002Fmodels--garage-bAInd--Platypus2-70B-instruct\u002Fsnapshots\u002Fb585e74bcaae02e52665d9ac6d23f4d0dbc81a0f\")\n\ninput_text = [\n        'What is the capital of United States?',\n        #'I like',\n    ]\n\ninput_tokens = model.tokenizer(input_text,\n    return_tensors=\"pt\", \n    return_attention_mask=False, \n    truncation=True, \n    max_length=MAX_LENGTH, \n    padding=False)\n           \ngeneration_output = model.generate(\n    input_tokens['input_ids'].cuda(), \n    max_new_tokens=20,\n    use_cache=True,\n    return_dict_in_generate=True)\n\noutput = model.tokenizer.decode(generation_output.sequences[0])\n\nprint(output)\n\n```\n \n \nNote: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory.\n \n\n## Model Compression - 3x Inference Speed Up!\n\nWe just added model compression based on block-wise quantization-based model compression. Which can further **speed up the inference speed** for up to **3x** , with **almost ignorable accuracy loss!** (see more performance evaluation and why we use block-wise quantization in [this paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09720))\n\n![speed_improvement](https:\u002F\u002Fgithub.com\u002Flyogavin\u002Fairllm\u002Fblob\u002Fmain\u002Fassets\u002Fairllm2_time_improvement.png?v=2&raw=true)\n\n#### How to enable model compression speed up:\n\n* Step 1. make sure you have [bitsandbytes](https:\u002F\u002Fgithub.com\u002FTimDettmers\u002Fbitsandbytes) installed by `pip install -U bitsandbytes `\n* Step 2. make sure airllm verion later than 2.0.0: `pip install -U airllm` \n* Step 3. when initialize the model, passing the argument compression ('4bit' or '8bit'):\n\n```python\nmodel = AutoModel.from_pretrained(\"garage-bAInd\u002FPlatypus2-70B-instruct\",\n                     compression='4bit' # specify '8bit' for 8-bit block-wise quantization \n                    )\n```\n\n#### What are the differences between model compression and quantization?\n\nQuantization normally needs to quantize both weights and activations to really speed things up. Which makes it harder to maintain accuracy and avoid the impact of outliers in all kinds of inputs.\n\nWhile in our case the bottleneck is mainly at the disk loading, we only need to make the model loading size smaller. So, we get to only quantize the weights' part, which is easier to ensure the accuracy.\n\n## Configurations\n \nWhen initialize the model, we support the following configurations:\n\n* **compression**: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression\n* **profiling_mode**: supported options: True to output time consumptions or by default False\n* **layer_shards_saving_path**: optionally another path to save the splitted model\n* **hf_token**: huggingface token can be provided here if downloading gated models like: *meta-llama\u002FLlama-2-7b-hf*\n* **prefetching**: prefetching to overlap the model loading and compute. By default, turned on. For now, only AirLLMLlama2 supports this.\n* **delete_original**: if you don't have too much disk space, you can set delete_original to true to delete the original downloaded hugging face model, only keep the transformed one to save half of the disk space. \n\n## MacOS\n\nJust install airllm and run the code the same as on linux. See more in [Quick Start](#quickstart).\n\n* make sure you installed [mlx](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx?tab=readme-ov-file#installation) and torch\n* you probably need to install python native see more [here](https:\u002F\u002Fstackoverflow.com\u002Fa\u002F65432861\u002F21230266)\n* only [Apple silicon](https:\u002F\u002Fsupport.apple.com\u002Fen-us\u002FHT211814) is supported\n\nExample [python notebook] (https:\u002F\u002Fgithub.com\u002Flyogavin\u002Fairllm\u002Fblob\u002Fmain\u002Fair_llm\u002Fexamples\u002Frun_on_macos.ipynb)\n\n\n## Example Python Notebook\n\nExample colabs here:\n\n\u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Flyogavin\u002Fairllm\u002Fblob\u002Fmain\u002Fair_llm\u002Fexamples\u002Frun_all_types_of_models.ipynb\">\n  \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\"\u002F>\n\u003C\u002Fa>\n\n#### example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):\n\n\u003Cdetails>\n\n\n* ChatGLM:\n\n```python\nfrom airllm import AutoModel\nMAX_LENGTH = 128\nmodel = AutoModel.from_pretrained(\"THUDM\u002Fchatglm3-6b-base\")\ninput_text = ['What is the capital of China?',]\ninput_tokens = model.tokenizer(input_text,\n    return_tensors=\"pt\", \n    return_attention_mask=False, \n    truncation=True, \n    max_length=MAX_LENGTH, \n    padding=True)\ngeneration_output = model.generate(\n    input_tokens['input_ids'].cuda(), \n    max_new_tokens=5,\n    use_cache= True,\n    return_dict_in_generate=True)\nmodel.tokenizer.decode(generation_output.sequences[0])\n```\n\n* QWen:\n\n```python\nfrom airllm import AutoModel\nMAX_LENGTH = 128\nmodel = AutoModel.from_pretrained(\"Qwen\u002FQwen-7B\")\ninput_text = ['What is the capital of China?',]\ninput_tokens = model.tokenizer(input_text,\n    return_tensors=\"pt\", \n    return_attention_mask=False, \n    truncation=True, \n    max_length=MAX_LENGTH)\ngeneration_output = model.generate(\n    input_tokens['input_ids'].cuda(), \n    max_new_tokens=5,\n    use_cache=True,\n    return_dict_in_generate=True)\nmodel.tokenizer.decode(generation_output.sequences[0])\n```\n\n\n* Baichuan, InternLM, Mistral, etc:\n\n```python\nfrom airllm import AutoModel\nMAX_LENGTH = 128\nmodel = AutoModel.from_pretrained(\"baichuan-inc\u002FBaichuan2-7B-Base\")\n#model = AutoModel.from_pretrained(\"internlm\u002Finternlm-20b\")\n#model = AutoModel.from_pretrained(\"mistralai\u002FMistral-7B-Instruct-v0.1\")\ninput_text = ['What is the capital of China?',]\ninput_tokens = model.tokenizer(input_text,\n    return_tensors=\"pt\", \n    return_attention_mask=False, \n    truncation=True, \n    max_length=MAX_LENGTH)\ngeneration_output = model.generate(\n    input_tokens['input_ids'].cuda(), \n    max_new_tokens=5,\n    use_cache=True,\n    return_dict_in_generate=True)\nmodel.tokenizer.decode(generation_output.sequences[0])\n```\n\n\n\u003C\u002Fdetails>\n\n\n#### To request other model support: [here](https:\u002F\u002Fdocs.google.com\u002Fforms\u002Fd\u002Fe\u002F1FAIpQLSe0Io9ANMT964Zi-OQOq1TJmnvP-G3_ZgQDhP7SatN0IEdbOg\u002Fviewform?usp=sf_link)\n\n\n\n## Acknowledgement\n\nA lot of the code are based on SimJeg's great work in the Kaggle exam competition. Big shoutout to SimJeg:\n\n[GitHub account @SimJeg](https:\u002F\u002Fgithub.com\u002FSimJeg), \n[the code on Kaggle](https:\u002F\u002Fwww.kaggle.com\u002Fcode\u002Fsimjeg\u002Fplatypus2-70b-with-wikipedia-rag), \n[the associated discussion](https:\u002F\u002Fwww.kaggle.com\u002Fcompetitions\u002Fkaggle-llm-science-exam\u002Fdiscussion\u002F446414).\n\n\n## FAQ\n\n### 1. MetadataIncompleteBuffer\n\nsafetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer\n\nIf you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. See [this](https:\u002F\u002Fhuggingface.co\u002FTheBloke\u002Fguanaco-65B-GPTQ\u002Fdiscussions\u002F12). You may need to extend your disk space, clear huggingface [.cache](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fcache) and rerun. \n\n### 2. ValueError: max() arg is an empty sequence\n\nMost likely you are loading QWen or ChatGLM model with Llama2 class. Try the following:\n\nFor QWen model: \n\n```python\nfrom airllm import AutoModel #\u003C----- instead of AirLLMLlama2\nAutoModel.from_pretrained(...)\n```\n\nFor ChatGLM model: \n\n```python\nfrom airllm import AutoModel #\u003C----- instead of AirLLMLlama2\nAutoModel.from_pretrained(...)\n```\n\n### 3. 401 Client Error....Repo model ... is gated.\n\nSome models are gated models, needs huggingface api token. You can provide hf_token:\n\n```python\nmodel = AutoModel.from_pretrained(\"meta-llama\u002FLlama-2-7b-hf\", #hf_token='HF_API_TOKEN')\n```\n\n### 4. ValueError: Asking to pad but the tokenizer does not have a padding token.\n\nSome model's tokenizer doesn't have padding token, so you can set a padding token or simply turn the padding config off:\n\n ```python\ninput_tokens = model.tokenizer(input_text,\n    return_tensors=\"pt\", \n    return_attention_mask=False, \n    truncation=True, \n    max_length=MAX_LENGTH, \n    padding=False  #\u003C-----------   turn off padding \n)\n```\n\n## Citing AirLLM\n\nIf you find\nAirLLM useful in your research and wish to cite it, please use the following\nBibTex entry:\n\n```\n@software{airllm2023,\n  author = {Gavin Li},\n  title = {AirLLM: scaling large language models on low-end commodity computers},\n  url = {https:\u002F\u002Fgithub.com\u002Flyogavin\u002Fairllm\u002F},\n  version = {0.0},\n  year = {2023},\n}\n```\n\n\n## Contribution \n\nWelcomed contributions, ideas and discussions!\n\nIf you find it useful, please ⭐ or buy me a coffee! 🙏\n\n[![\"Buy Me A Coffee\"](https:\u002F\u002Fwww.buymeacoffee.com\u002Fassets\u002Fimg\u002Fcustom_images\u002Forange_img.png)](https:\u002F\u002Fbmc.link\u002FlyogavinQ)\n","AirLLM 是一个优化大语言模型推理内存使用的项目，使得70亿参数规模的语言模型能够在单个4GB GPU上运行，无需量化、蒸馏或剪枝处理。该项目的核心功能包括高效内存管理、支持多种主流大语言模型（如Llama3.1 405B版本在8GB显存下运行）以及自动检测模型类型以简化初始化过程。此外，它还提供了对MacOS的支持，并且能够通过量化技术进一步降低硬件要求。AirLLM非常适合那些希望利用有限计算资源进行大规模语言模型实验的研究人员和开发者使用。",2,"2026-06-11 03:43:21","high_star"]