[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9707":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":18,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},9707,"text-generation-inference","huggingface\u002Ftext-generation-inference","huggingface","Large Language Model Text Generation Inference","http:\u002F\u002Fhf.co\u002Fdocs\u002Ftext-generation-inference",null,"Python",10862,1271,99,285,0,1,4,12,44.31,"Apache License 2.0",true,false,"main",[26,27,28,29,30,31,32,33,34],"bloom","deep-learning","falcon","gpt","inference","nlp","pytorch","starcoder","transformer","2026-06-12 02:02:11","> [!CAUTION]\n> text-generation-inference is now in maintenance mode. Going forward, we will accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks.\n>\n> TGI has initiated the movement for optimized inference engines to rely on a `transformers` model architectures. This approach is now adopted by downstream inference engines, which we contribute to and recommend using going forward: [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang), as well as local engines with inter-compatibility such as llama.cpp or MLX.\n\n\u003Cdiv align=\"center\">\n\n\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=jlMAX2Oaht0\">\n  \u003Cimg width=560 alt=\"Making TGI deployment optimal\" src=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FNarsil\u002Ftgi_assets\u002Fresolve\u002Fmain\u002Fthumbnail.png\">\n\u003C\u002Fa>\n\n# Text Generation Inference\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference\">\n  \u003Cimg alt=\"GitHub Repo stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Ftext-generation-inference?style=social\">\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.github.io\u002Ftext-generation-inference\">\n  \u003Cimg alt=\"Swagger API documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAPI-Swagger-informational\">\n\u003C\u002Fa>\n\nA Rust, Python and gRPC server for text generation inference. Used in production at [Hugging Face](https:\u002F\u002Fhuggingface.co)\nto power Hugging Chat, the Inference API and Inference Endpoints.\n\n\u003C\u002Fdiv>\n\n## Table of contents\n\n  - [Get Started](#get-started)\n    - [Docker](#docker)\n    - [API documentation](#api-documentation)\n    - [Using a private or gated model](#using-a-private-or-gated-model)\n    - [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm)\n    - [Distributed Tracing](#distributed-tracing)\n    - [Architecture](#architecture)\n    - [Local install](#local-install)\n    - [Local install (Nix)](#local-install-nix)\n  - [Optimized architectures](#optimized-architectures)\n  - [Run locally](#run-locally)\n    - [Run](#run)\n    - [Quantization](#quantization)\n  - [Develop](#develop)\n  - [Testing](#testing)\n\nText Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fsupported_models). TGI implements many features, such as:\n\n- Simple launcher to serve most popular LLMs\n- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)\n- Tensor Parallelism for faster inference on multiple GPUs\n- Token streaming using Server-Sent Events (SSE)\n- Continuous batching of incoming requests for increased total throughput\n- [Messages API](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fen\u002Fmessages_api) compatible with Open AI Chat Completion API\n- Optimized transformers code for inference using [Flash Attention](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fflash-attention) and [Paged Attention](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) on the most popular architectures\n- Quantization with :\n  - [bitsandbytes](https:\u002F\u002Fgithub.com\u002FTimDettmers\u002Fbitsandbytes)\n  - [GPT-Q](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323)\n  - [EETQ](https:\u002F\u002Fgithub.com\u002FNetEase-FuXi\u002FEETQ)\n  - [AWQ](https:\u002F\u002Fgithub.com\u002Fcasper-hansen\u002FAutoAWQ)\n  - [Marlin](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin)\n  - [fp8](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fnvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai\u002F)\n- [Safetensors](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fsafetensors) weight loading\n- Watermarking with [A Watermark for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.10226)\n- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Finternal\u002Fgeneration_utils#transformers.LogitsProcessor))\n- Stop sequences\n- Log probabilities\n- [Speculation](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fconceptual\u002Fspeculation) ~2x latency\n- [Guidance\u002FJSON](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fconceptual\u002Fguidance). Specify output format to speed up inference and make sure the output is valid according to some specs..\n- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output\n- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance\n\n### Hardware support\n\n- [Nvidia](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference\u002Fpkgs\u002Fcontainer\u002Ftext-generation-inference)\n- [AMD](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference\u002Fpkgs\u002Fcontainer\u002Ftext-generation-inference) (-rocm)\n- [Inferentia](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Foptimum-neuron\u002Ftree\u002Fmain\u002Ftext-generation-inference)\n- [Intel GPU](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference\u002Fpull\u002F1475)\n- [Gaudi](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftgi-gaudi)\n- [Google TPU](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Foptimum-tpu\u002Fhowto\u002Fserving)\n\n\n## Get Started\n\n### Docker\n\nFor a detailed starting guide, please see the [Quick Tour](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fquicktour). The easiest way of getting started is using the official Docker container:\n\n```shell\nmodel=HuggingFaceH4\u002Fzephyr-7b-beta\n# share a volume with the Docker container to avoid downloading weights every run\nvolume=$PWD\u002Fdata\n\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:\u002Fdata \\\n    ghcr.io\u002Fhuggingface\u002Ftext-generation-inference:3.3.5 --model-id $model\n```\n\nAnd then you can make requests like\n\n```bash\ncurl 127.0.0.1:8080\u002Fgenerate_stream \\\n    -X POST \\\n    -d '{\"inputs\":\"What is Deep Learning?\",\"parameters\":{\"max_new_tokens\":20}}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\nYou can also use [TGI's Messages API](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fen\u002Fmessages_api) to obtain Open AI Chat Completion API compatible responses.\n\n```bash\ncurl localhost:8080\u002Fv1\u002Fchat\u002Fcompletions \\\n    -X POST \\\n    -d '{\n  \"model\": \"tgi\",\n  \"messages\": [\n    {\n      \"role\": \"system\",\n      \"content\": \"You are a helpful assistant.\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"What is deep learning?\"\n    }\n  ],\n  \"stream\": true,\n  \"max_tokens\": 20\n}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.\n\n**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Finstallation_amd#using-tgi-with-amd-gpus). To use AMD GPUs, please use `docker run --device \u002Fdev\u002Fkfd --device \u002Fdev\u002Fdri --shm-size 1g -p 8080:80 -v $volume:\u002Fdata ghcr.io\u002Fhuggingface\u002Ftext-generation-inference:3.3.5-rocm --model-id $model` instead of the command above.\n\nTo see all options to serve your models (in the [code](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference\u002Fblob\u002Fmain\u002Flauncher\u002Fsrc\u002Fmain.rs) or in the cli):\n```\ntext-generation-launcher --help\n```\n\n### API documentation\n\nYou can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `\u002Fdocs` route.\nThe Swagger UI is also available at: [https:\u002F\u002Fhuggingface.github.io\u002Ftext-generation-inference](https:\u002F\u002Fhuggingface.github.io\u002Ftext-generation-inference).\n\n### Using a private or gated model\n\nYou have the option to utilize the `HF_TOKEN` environment variable for configuring the token employed by\n`text-generation-inference`. This allows you to gain access to protected resources.\n\nFor example, if you want to serve the gated Llama V2 model variants:\n\n1. Go to https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\n2. Copy your CLI READ token\n3. Export `HF_TOKEN=\u003Cyour CLI READ token>`\n\nor with Docker:\n\n```shell\nmodel=meta-llama\u002FMeta-Llama-3.1-8B-Instruct\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\ntoken=\u003Cyour cli READ token>\n\ndocker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:\u002Fdata \\\n    ghcr.io\u002Fhuggingface\u002Ftext-generation-inference:3.3.5 --model-id $model\n```\n\n### A note on Shared Memory (shm)\n\n[`NCCL`](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Fnccl\u002Fuser-guide\u002Fdocs\u002Findex.html) is a communication framework used by\n`PyTorch` to do distributed training\u002Finference. `text-generation-inference` makes\nuse of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.\n\nIn order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if\npeer-to-peer using NVLink or PCI is not possible.\n\nTo allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.\n\nIf you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by\ncreating a volume with:\n\n```yaml\n- name: shm\n  emptyDir:\n   medium: Memory\n   sizeLimit: 1Gi\n```\n\nand mounting it to `\u002Fdev\u002Fshm`.\n\nFinally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that\nthis will impact performance.\n\n### Distributed Tracing\n\n`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature\nby setting the address to an OTLP collector with the `--otlp-endpoint` argument. The default service name can be\noverridden with the `--otlp-service-name` argument\n\n### Architecture\n\n![TGI architecture](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fdocumentation-images\u002Fresolve\u002Fmain\u002FTGI.png)\n\nDetailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https:\u002F\u002Fwww.adyen.com\u002Fknowledge-hub\u002Fllm-inference-at-scale-with-tgi)\n\n### Local install\n\nYou can also opt to install `text-generation-inference` locally.\n\nFirst clone the repository and change directory into it:\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference\ncd text-generation-inference\n```\n\nThen [install Rust](https:\u002F\u002Frustup.rs\u002F) and create a Python virtual environment with at least\nPython 3.9, e.g. using `conda` or `python venv`:\n\n```shell\ncurl --proto '=https' --tlsv1.2 -sSf https:\u002F\u002Fsh.rustup.rs | sh\n\n#using conda\nconda create -n text-generation-inference python=3.11\nconda activate text-generation-inference\n\n#using python venv\npython3 -m venv .venv\nsource .venv\u002Fbin\u002Factivate\n```\n\nYou may also need to install Protoc.\n\nOn Linux:\n\n```shell\nPROTOC_ZIP=protoc-21.12-linux-x86_64.zip\ncurl -OL https:\u002F\u002Fgithub.com\u002Fprotocolbuffers\u002Fprotobuf\u002Freleases\u002Fdownload\u002Fv21.12\u002F$PROTOC_ZIP\nsudo unzip -o $PROTOC_ZIP -d \u002Fusr\u002Flocal bin\u002Fprotoc\nsudo unzip -o $PROTOC_ZIP -d \u002Fusr\u002Flocal 'include\u002F*'\nrm -f $PROTOC_ZIP\n```\n\nOn MacOS, using Homebrew:\n\n```shell\nbrew install protobuf\n```\n\nThen run:\n\n```shell\nBUILD_EXTENSIONS=True make install # Install repository and HF\u002Ftransformer fork with CUDA kernels\ntext-generation-launcher --model-id mistralai\u002FMistral-7B-Instruct-v0.2\n```\n\n**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:\n\n```shell\nsudo apt-get install libssl-dev gcc -y\n```\n\n### Local install (Nix)\n\nAnother option is to install `text-generation-inference` locally using [Nix](https:\u002F\u002Fnixos.org). Currently,\nwe only support Nix on x86_64 Linux with CUDA GPUs. When using Nix, all dependencies can\nbe pulled from a binary cache, removing the need to build them locally.\n\nFirst follow the instructions to [install Cachix and enable the Hugging Face cache](https:\u002F\u002Fapp.cachix.org\u002Fcache\u002Fhuggingface).\nSetting up the cache is important, otherwise Nix will build many of the dependencies\nlocally, which can take hours.\n\nAfter that you can run TGI with `nix run`:\n\n```shell\ncd text-generation-inference\nnix run --extra-experimental-features nix-command --extra-experimental-features flakes . -- --model-id meta-llama\u002FLlama-3.1-8B-Instruct\n```\n\n**Note:** when you are using Nix on a non-NixOS system, you have to [make some symlinks](https:\u002F\u002Fdanieldk.eu\u002FNix-CUDA-on-non-NixOS-systems#make-runopengl-driverlib-and-symlink-the-driver-library)\nto make the CUDA driver libraries visible to Nix packages.\n\nFor TGI development, you can use the `impure` dev shell:\n\n```shell\nnix develop .#impure\n\n# Only needed the first time the devshell is started or after updating the protobuf.\n(\ncd server\nmkdir text_generation_server\u002Fpb || true\npython -m grpc_tools.protoc -I..\u002Fproto\u002Fv3 --python_out=text_generation_server\u002Fpb \\\n       --grpc_python_out=text_generation_server\u002Fpb --mypy_out=text_generation_server\u002Fpb ..\u002Fproto\u002Fv3\u002Fgenerate.proto\nfind text_generation_server\u002Fpb\u002F -type f -name \"*.py\" -print0 -exec sed -i -e 's\u002F^\\(import.*pb2\\)\u002Ffrom . \\1\u002Fg' {} \\;\ntouch text_generation_server\u002Fpb\u002F__init__.py\n)\n```\n\nAll development dependencies (cargo, Python, Torch), etc. are available in this\ndev shell.\n\n## Optimized architectures\n\nTGI works out of the box to serve optimized models for all modern models. They can be found in [this list](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fsupported_models).\n\nOther architectures are supported on a best-effort basis using:\n\n`AutoModelForCausalLM.from_pretrained(\u003Cmodel>, device_map=\"auto\")`\n\nor\n\n`AutoModelForSeq2SeqLM.from_pretrained(\u003Cmodel>, device_map=\"auto\")`\n\n\n\n## Run locally\n\n### Run\n\n```shell\ntext-generation-launcher --model-id mistralai\u002FMistral-7B-Instruct-v0.2\n```\n\n### Quantization\n\nYou can also run pre-quantized weights (AWQ, GPTQ, Marlin) or on-the-fly quantize weights with bitsandbytes, EETQ, fp8, to reduce the VRAM requirement:\n\n```shell\ntext-generation-launcher --model-id mistralai\u002FMistral-7B-Instruct-v0.2 --quantize\n```\n\n4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.\n\nRead more about quantization in the [Quantization documentation](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fen\u002Fconceptual\u002Fquantization).\n\n## Develop\n\n```shell\nmake server-dev\nmake router-dev\n```\n\n## Testing\n\n```shell\n# python\nmake python-server-tests\nmake python-client-tests\n# or both server and client tests\nmake python-tests\n# rust cargo tests\nmake rust-tests\n# integration tests\nmake integration-tests\n```\n","Text Generation Inference (TGI) 是一个用于部署和服务大规模语言模型（LLM）的工具包。它支持包括Llama、Falcon、StarCoder等在内的多种流行开源LLM，提供高效文本生成服务。其核心功能包括简易启动器、生产级特性如分布式追踪和Prometheus监控指标、张量并行化以加速多GPU推理、基于Server-Sent Events的令牌流传输以及持续批量处理请求以提高整体吞吐量。此外，TGI还兼容Open AI Chat Completion API的消息API，并通过Flash Attention等技术优化了transformers代码以提升推理性能。该项目适合需要高性能文本生成服务的应用场景，比如在线聊天机器人、自动内容创作系统等。",2,"2026-06-11 03:24:18","top_topic"]