[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71117":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":28,"discoverSource":29},71117,"gpt-fast","meta-pytorch\u002Fgpt-fast","meta-pytorch","Simple and efficient pytorch-native transformer text generation in \u003C1000 LOC of python.","",null,"Python",6218,573,59,76,0,2,15,6,70.78,"BSD 3-Clause \"New\" or \"Revised\" License",false,"main",[],"2026-06-12 04:00:59","# gpt-fast\nSimple and efficient pytorch-native transformer text generation.\n\nFeaturing:\n1. Very low latency\n2. \u003C1000 lines of python\n3. No dependencies other than PyTorch and sentencepiece\n4. int8\u002Fint4 quantization\n5. Speculative decoding\n6. Tensor parallelism\n7. Supports Nvidia and AMD GPUs\n\nThis is *NOT* intended to be a \"framework\" or \"library\" - it is intended to show off what kind of performance you can get with native PyTorch :) Please copy-paste and fork as you desire.\n\nFor an in-depth walkthrough of what's in this codebase, see this [blog post](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-generative-ai-2\u002F).\n\n## Supported Models\n\n### LLaMA family\nPlease check the rest of this page about benchmark of LLaMA family models.\n\n### Mixtral 8x7B\nWe also supported [Mixtral 8x7B](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fmixtral-of-experts\u002F) which is a high-quality sparse mixture of experts (MoE) model, the average token generation rates are:\n\n|                  |   1 GPU |    2 GPU  | 4 GPU  |    8 GPU   |\n|------------------|---------|-----------|--------|------------|\n|baseline(bfloat16)|    OOM  |    96.67  | 155.35 |  227.82    |\n|        int8      |   97.92 |   155.03  | 216.87 |  279.35    |\n\nNote that the benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh topology. Note that all benchmarks are run at *batch size=1*, making the reported tokens\u002Fs numbers equivalent to \"tokens\u002Fs\u002Fuser\". In addition, they are run with a very small prompt length (just 5 tokens).\n\nFor more details about Mixtral 8x7B, please check [this page](.\u002Fmixtral-moe) or this [note](https:\u002F\u002Fthonking.substack.com\u002Fp\u002Fshort-supporting-mixtral-in-gpt-fast).\n\n## Examples\nIn the spirit of keeping the repo minimal, here are various examples of extensions you can make to gpt-fast as PRs.\n- [Google Gemma](https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fgpt-fast\u002Fpull\u002F115)\n- [xAI Grok-1](https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fgpt-fast\u002Fpull\u002F171)\n- [Databricks DBRX](https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fgpt-fast\u002Fpull\u002F174)\n\n## Community\n\nProjects inspired by gpt-fast in the community:\n\n- [gpt-blazing](https:\u002F\u002Fgithub.com\u002Farmed-gpt\u002Fgpt-blazing): applies the same performance optimization strategy to more models (e.g., baichuan2).\n- [gptfast](https:\u002F\u002Fgithub.com\u002FMDK8888\u002FGPTFast): applies a subset of the performance optimizations to all Huggingface models\n- [gpt-accelera](https:\u002F\u002Fgithub.com\u002FEdward-Sun\u002Fgpt-accelera): extends `gpt-fast` to SFT\u002FRM\u002FPPO training and batched inference to optimize the throughput\n\n## Installation\n[Download PyTorch nightly](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F)\n\nInstall required packages:\n\n```bash\npip install -r requirements.txt\n```\n\nTo download llama models, go to https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-2-7b and go through steps to obtain access.\nThen login with `huggingface-cli login`\n\n\n\n## Downloading Weights\nModels tested\u002Fsupported\n```text\ntinyllamas\u002Fstories{15,42,100}\nopenlm-research\u002Fopen_llama_7b\nmeta-llama\u002FLlama-2-7b-chat-hf\nmeta-llama\u002FLlama-2-13b-chat-hf\nmeta-llama\u002FLlama-2-70b-chat-hf\ncodellama\u002FCodeLlama-7b-Python-hf\ncodellama\u002FCodeLlama-34b-Python-hf\nmistralai\u002FMistral-7B-v0.1\nmistralai\u002FMistral-7B-Instruct-v0.1\nmistralai\u002FMistral-7B-Instruct-v0.2\nmeta-llama\u002FMeta-Llama-3-8B\nmeta-llama\u002FMeta-Llama-3.1-8B\nmeta-llama\u002FMeta-Llama-3.1-70B\nmeta-llama\u002FMeta-Llama-3.1-405B\n```\n\nFor example, to convert Llama-2-7b-chat-hf\n```bash\nexport MODEL_REPO=meta-llama\u002FLlama-2-7b-chat-hf\n.\u002Fscripts\u002Fprepare.sh $MODEL_REPO\n```\n\n## Benchmarks\nBenchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh topology. Note that all benchmarks are run at *batch size=1*, making the reported tokens\u002Fs numbers equivalent to \"tokens\u002Fs\u002Fuser\". In addition, they are run with a very small prompt length (just 5 tokens).\n\n| Model    | Technique | Tokens\u002FSecond | Memory Bandwidth (GB\u002Fs) |\n| -------- | ------- | ------ | ------ |\n| Llama-2-7B  | Base    |  104.9  | 1397.31 |\n|           | 8-bit   | 155.58   | 1069.20 |\n|           | 4-bit (G=32)   | 196.80   | 862.69 |\n| Llama-2-70B | Base    | OOM     ||\n|           | 8-bit   | 19.13    | 1322.58 |\n|           | 4-bit (G=32)   | 25.25    | 1097.66 |\n| Llama-3.1-8B  | Base    |  93.89  | 1410.76 |\n|           | 8-bit   | 137.64   | 1030.89 |\n| Llama-3.1-70B | Base    | OOM     ||\n|           | 8-bit   | 18.04    | 1253.78 |\n\n### Speculative Sampling\n[Verifier: Llama-70B (int4), Draft: Llama-7B (int4)](.\u002Fscripts\u002Fspeculate_70B_int4.sh): 48.4 tok\u002Fs\n\n### Tensor Parallelism\n| Model    | Number of GPUs | Tokens\u002FSecond | Memory Bandwidth (GB\u002Fs) |\n| -------- | ------- | ------ | ------ |\n| Llama-2-7B  | 1    |  104.9  | 1397.31 |\n|           | 2   | 168.84   | 1181.99 |\n|           | 4   | 254.02   | 955.83 |\n|           | 8   | 328.43   | 704.10 |\n| Llama-2-70B  | 1    |  OOM  |  |\n|           | 2   | 21.32   | 1481.87 |\n|           | 4   | 38.01   | 1340.76 |\n|           | 8   | 62.50   | 1135.29 |\n| Llama-3.1-8B  | 1    |  93.83  | 1408.37 |\n|           | 2   | 149.10   | 1197.32 |\n|           | 4   | 217.21   | 986.32  |\n|           | 8   | 276.01   | 772.60 |\n| Llama-3.1-70B  | 1    |  OOM  |  |\n|           | 2   | 16.03   | 1130.81 |\n|           | 4   | 37.45   | 1360.53 |\n|           | 8   | 58.78   | 1129.61 |\n\n### Tensor Parallelism + Quantization\n| Model    | Technique | Tokens\u002FSecond | Memory Bandwidth (GB\u002Fs) |\n| -------- | ------- | ------ | ------ |\n| Llama-2-70B | Base    | 62.50     | 1135.29 |\n|           | 8-bit   | 80.44    | 752.04 |\n|           | 4-bit (G=32)   | 90.77    | 548.10 |\n| Llama-3.1-70B | Base    | 58.78     | 1129.61 |\n|           | 8-bit   | 75.58    | 726.57 |\n| Llama-3.1-405B | 8-bit | 15.60 | 815.87 |\n\n### AMD\nBenchmarks run on one GCD of a MI-250x.\n\n| Model    | Technique | Tokens\u002FSecond | Memory Bandwidth (GB\u002Fs) |\n| -------- | ------- | ------ | ------ |\n| Llama-2-7B  | Base    |  76.33  | 1028.70 |\n|           | 8-bit   | 101.86   | 700.06 |\n\n## Generate Text\n\nModel definition in `model.py`, generation code in `generate.py`.\n\n```bash\npython generate.py --compile --checkpoint_path checkpoints\u002F$MODEL_REPO\u002Fmodel.pth --prompt \"Hello, my name is\"\n```\n\nTo squeeze out a little bit more performance, you can also compile the prefill with `--compile_prefill`. This will increase compilation times though.\n\n## Quantization\nChoose device to use by\n```bash\n# The current support devices: cuda, cpu\nexport DEVICE=cuda\n```\n### Int8 Weight-Only Quantization\nTo generate this version of the model\n```bash\n# Spits out model at checkpoints\u002F$MODEL_REPO\u002Fmodel_int8.pth\npython quantize.py --checkpoint_path checkpoints\u002F$MODEL_REPO\u002Fmodel.pth --mode int8\n```\nTo run with int8, just pass the int8 checkpoint to generate.py.\n```bash\npython generate.py --compile --checkpoint_path checkpoints\u002F$MODEL_REPO\u002Fmodel_int8.pth --device $DEVICE\n```\n\n### Int4 Weight-Only Quantization\nTo generate int4 version of model\n```bash\n# Spits out model at checkpoints\u002F$MODEL_REPO\u002Fmodel_int4.g32.$DEVICE.pth\npython quantize.py --checkpoint_path checkpoints\u002F$MODEL_REPO\u002Fmodel.pth --mode int4 --groupsize 32\n```\n\nTo run with int4, just pass the int4 checkpoint to generate.py.\n```bash\npython generate.py --checkpoint_path checkpoints\u002F$MODEL_REPO\u002Fmodel_int4.g32.pth --compile\n```\n\n## Speculative Sampling\nTo generate with speculative sampling (DRAFT_MODEL_REPO should point to a smaller model compared with MODEL_REPO).\n\nIn this example, the \"smaller\" model is just the int8 quantized version of the model.\n```\nexport DRAFT_MODEL_REPO=meta-llama\u002FLlama-2-7b-chat-hf\npython generate.py --compile --checkpoint_path checkpoints\u002F$MODEL_REPO\u002Fmodel.pth --draft_checkpoint_path checkpoints\u002F$DRAFT_MODEL_REPO\u002Fmodel_int8.pth\n```\n\nNote: Running on an A100 80GB, albeit power-limited to 330 watts. Empirically, seems like peak bandwidth is about 1700 GB\u002Fs.\n\n\n## Tensor Parallelism\n```bash\nENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=2 generate.py --compile --checkpoint_path checkpoints\u002F$MODEL_REPO\u002Fmodel.pth\n```\n\n## Experimental\n### Evaluation\nWe use the EleutherAI evaluation harness to evaluate our model accuracy. To evaluate the accuracy, make sure the evaluation harness is installed and pass your model checkpoint and desired tasks to eval.py.\n\n```bash\npython eval.py --checkpoint_path checkpoints\u002F$MODEL_REPO\u002Fmodel.pth --compile --tasks hellaswag winogrande\n```\n\nNote: Generative tasks are currently not supported for gpt-fast\n\nInstallation Instructions for the evaluation harness: https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmaster#install\n\n### GPTQ\nWe have a pure pytorch implementation of GPTQ that utilizes torch._dynamo.export to access the model structure. You can generate a GPTQ quantized\nversion of int4 quantization by using the same command to quantize it but adding 'gptq' to the quantization mode i.e.\n```bash\n# Spits out model at checkpoints\u002F$MODEL_REPO\u002Fmodel_int4-gptq.g32.pth\npython quantize.py --mode int4-gptq --calibration_tasks wikitext --calibration_seq_length 2048\n```\n\nYou can then eval or generate text with this model in the same way as above.\n\n## License\n\n`gpt-fast` is released under the [BSD 3](https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fgpt-fast\u002Fmain\u002FLICENSE) license.\n\n## Acknowledgements\nThanks to:\n* Lightning AI for supporting pytorch and work in flash attention, int8 quantization, and LoRA fine-tuning.\n* GGML for driving forward fast, on device inference of LLMs\n* Karpathy for spearheading simple, interpretable and fast LLM implementations\n* MLC-LLM for pushing 4-bit quantization performance on heterogeneous hardware\n","gpt-fast 是一个基于 PyTorch 的高效文本生成项目，旨在通过简洁的代码实现高性能的Transformer模型。其核心功能包括极低延迟、量化支持（int8\u002Fint4）、推测解码以及张量并行化等技术特点，并且仅依赖PyTorch和sentencepiece两个库。该项目特别适合需要快速原型设计或性能优化的研究人员及开发者使用，在单机环境下即可获得接近最优的文本生成速度，尤其对NVIDIA与AMD GPU提供了良好支持。此外，它还展示了如何利用原生PyTorch达到高性能，鼓励用户根据需求进行自定义扩展。","2026-06-11 03:35:57","high_star"]