[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-5767":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":34,"discoverSource":35},5767,"text-embeddings-inference","huggingface\u002Ftext-embeddings-inference","huggingface","A blazing fast inference solution for text embeddings models","https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-embeddings-inference\u002Fquick_tour",null,"Rust",4861,398,41,147,0,2,24,86,10,29.8,"Apache License 2.0",false,"main",true,[27,28,7,29,30],"ai","embeddings","llm","ml","2026-06-12 02:01:14","\u003Cdiv align=\"center\">\n\n# Text Embeddings Inference\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\">\n  \u003Cimg alt=\"GitHub Repo stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Ftext-embeddings-inference?style=social\">\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference\">\n  \u003Cimg alt=\"Swagger API documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAPI-Swagger-informational\">\n\u003C\u002Fa>\n\nA blazing fast inference solution for text embeddings models.\n\nBenchmark for [BAAI\u002Fbge-base-en-v1.5](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-base-en-v1.5) on an NVIDIA A10 with a sequence\nlength of 512 tokens:\n\n\u003Cp>\n  \u003Cimg src=\"assets\u002Fbs1-lat.png\" width=\"400\" \u002F>\n  \u003Cimg src=\"assets\u002Fbs1-tp.png\" width=\"400\" \u002F>\n\u003C\u002Fp>\n\u003Cp>\n  \u003Cimg src=\"assets\u002Fbs32-lat.png\" width=\"400\" \u002F>\n  \u003Cimg src=\"assets\u002Fbs32-tp.png\" width=\"400\" \u002F>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n## Table of contents\n\n- [Get Started](#get-started)\n    - [Supported Models](#supported-models)\n    - [Docker](#docker)\n    - [Docker Images](#docker-images)\n    - [API Documentation](#api-documentation)\n    - [Using a private or gated model](#using-a-private-or-gated-model)\n    - [Air gapped deployment](#air-gapped-deployment)\n    - [Using Re-rankers models](#using-re-rankers-models)\n    - [Using Sequence Classification models](#using-sequence-classification-models)\n    - [Using SPLADE pooling](#using-splade-pooling)\n    - [Distributed Tracing](#distributed-tracing)\n    - [gRPC](#grpc)\n- [Local Install](#local-install)\n    - [Apple Silicon (Homebrew)](#apple-silicon-homebrew)\n- [Docker Build](#docker-build)\n    - [ARM64 \u002F aarch64](#arm64--aarch64)\n- [AMD Instinct GPUs (ROCm)](#amd-instinct-gpus-rocm-experimental)\n- [Examples](#examples)\n\nText Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence\nclassification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding,\nEmber, GTE and E5. TEI implements many features such as:\n\n* No model graph compilation step\n* Metal support for local execution on Macs\n* Small docker images and fast boot times. Get ready for true serverless!\n* Token based dynamic batching\n* Optimized transformers code for inference using [Flash Attention](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fflash-attention),\n  [Candle](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fcandle)\n  and [cuBLASLt](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcublas\u002F#using-the-cublaslt-api)\n* [Safetensors](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fsafetensors) weight loading\n* [ONNX](https:\u002F\u002Fgithub.com\u002Fonnx\u002Fonnx) weight loading\n* Production ready (distributed tracing with Open Telemetry, Prometheus metrics)\n\n## Get Started\n\n### Supported Models\n\n#### Text Embeddings\n\nText Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT\nmodel with Alibi positions and Mistral, Alibaba GTE, Qwen2 models with Rope positions, MPNet, ModernBERT, Qwen3, and Gemma3.\n\nBelow are some examples of the currently supported models:\n\n| MTEB Rank | Model Size             | Model Type     | Model ID                                                                                         |\n|-----------|------------------------|----------------|--------------------------------------------------------------------------------------------------|\n| 2         | 7.57B (Very Expensive) | Qwen3          | [Qwen\u002FQwen3-Embedding-8B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-8B)                                 |\n| 3         | 4.02B (Very Expensive) | Qwen3          | [Qwen\u002FQwen3-Embedding-4B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-4B)                                 |\n| 4         | 509M                   | Qwen3          | [Qwen\u002FQwen3-Embedding-0.6B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-0.6B)                             |\n| 6         | 7.61B (Very Expensive) | Qwen2          | [Alibaba-NLP\u002Fgte-Qwen2-7B-instruct](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-Qwen2-7B-instruct)             |\n| 7         | 560M                   | XLM-RoBERTa    | [intfloat\u002Fmultilingual-e5-large-instruct](https:\u002F\u002Fhf.co\u002Fintfloat\u002Fmultilingual-e5-large-instruct) |\n| 8         | 308M                   | Gemma3         | [google\u002Fembeddinggemma-300m](https:\u002F\u002Fhf.co\u002Fgoogle\u002Fembeddinggemma-300m) (gated)                   |\n| 15        | 1.78B (Expensive)      | Qwen2          | [Alibaba-NLP\u002Fgte-Qwen2-1.5B-instruct](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-Qwen2-1.5B-instruct)         |\n| 18        | 7.11B (Very Expensive) | Mistral        | [Salesforce\u002FSFR-Embedding-2_R](https:\u002F\u002Fhf.co\u002FSalesforce\u002FSFR-Embedding-2_R)                       |\n| 35        | 568M                   | XLM-RoBERTa    | [Snowflake\u002Fsnowflake-arctic-embed-l-v2.0](https:\u002F\u002Fhf.co\u002FSnowflake\u002Fsnowflake-arctic-embed-l-v2.0) |\n| 41        | 305M                   | Alibaba GTE    | [Snowflake\u002Fsnowflake-arctic-embed-m-v2.0](https:\u002F\u002Fhf.co\u002FSnowflake\u002Fsnowflake-arctic-embed-m-v2.0) |\n| 52        | 335M                   | BERT           | [WhereIsAI\u002FUAE-Large-V1](https:\u002F\u002Fhf.co\u002FWhereIsAI\u002FUAE-Large-V1)                                   |\n| 58        | 137M                   | NomicBERT      | [nomic-ai\u002Fnomic-embed-text-v1](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v1)                       |\n| 79        | 137M                   | NomicBERT      | [nomic-ai\u002Fnomic-embed-text-v1.5](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v1.5)                   |\n| 103       | 109M                   | MPNet          | [sentence-transformers\u002Fall-mpnet-base-v2](https:\u002F\u002Fhf.co\u002Fsentence-transformers\u002Fall-mpnet-base-v2) |\n| N\u002FA       | 475M-A305M             | NomicBERT      | [nomic-ai\u002Fnomic-embed-text-v2-moe](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v2-moe)               |\n| N\u002FA       | 434M                   | Alibaba GTE    | [Alibaba-NLP\u002Fgte-large-en-v1.5](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-large-en-v1.5)                     |\n| N\u002FA       | 396M                   | ModernBERT     | [answerdotai\u002FModernBERT-large](https:\u002F\u002Fhf.co\u002Fanswerdotai\u002FModernBERT-large)                       |\n| N\u002FA       | 340M                   | Qwen3          | [voyageai\u002Fvoyage-4-nano](https:\u002F\u002Fhf.co\u002Fvoyageai\u002Fvoyage-4-nano)                                   |\n| N\u002FA       | 137M                   | JinaBERT       | [jinaai\u002Fjina-embeddings-v2-base-en](https:\u002F\u002Fhf.co\u002Fjinaai\u002Fjina-embeddings-v2-base-en)             |\n| N\u002FA       | 137M                   | JinaBERT       | [jinaai\u002Fjina-embeddings-v2-base-code](https:\u002F\u002Fhf.co\u002Fjinaai\u002Fjina-embeddings-v2-base-code)         |\n\nTo explore the list of best performing text embeddings models, visit the\n[Massive Text Embedding Benchmark (MTEB) Leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmteb\u002Fleaderboard).\n\n#### Sequence Classification and Re-Ranking\n\nText Embeddings Inference currently supports CamemBERT, and XLM-RoBERTa Sequence Classification models with absolute positions.\n\nBelow are some examples of the currently supported models:\n\n| Task               | Model Type  | Model ID                                                                                                        |\n|--------------------|-------------|-----------------------------------------------------------------------------------------------------------------|\n| Re-Ranking         | XLM-RoBERTa | [BAAI\u002Fbge-reranker-large](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-reranker-large)                                       |\n| Re-Ranking         | XLM-RoBERTa | [BAAI\u002Fbge-reranker-base](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-reranker-base)                                         |\n| Re-Ranking         | GTE         | [Alibaba-NLP\u002Fgte-multilingual-reranker-base](https:\u002F\u002Fhuggingface.co\u002FAlibaba-NLP\u002Fgte-multilingual-reranker-base) |\n| Re-Ranking         | ModernBert  | [Alibaba-NLP\u002Fgte-reranker-modernbert-base](https:\u002F\u002Fhuggingface.co\u002FAlibaba-NLP\u002Fgte-reranker-modernbert-base) |\n| Sentiment Analysis | RoBERTa     | [SamLowe\u002Froberta-base-go_emotions](https:\u002F\u002Fhuggingface.co\u002FSamLowe\u002Froberta-base-go_emotions)                     |\n\n### Docker\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\nAnd then you can make requests like\n\n```bash\ncurl 127.0.0.1:8080\u002Fembed \\\n    -X POST \\\n    -d '{\"inputs\":\"What is Deep Learning?\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n**Note:** To use GPUs, you need to install\nthe [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html).\nNVIDIA drivers on your machine need to be compatible with CUDA version 12.2 or higher.\n\nTo see all options to serve your models:\n\n```console\n$ text-embeddings-router --help\nText Embedding Webserver\n\nUsage: text-embeddings-router [OPTIONS] --model-id \u003CMODEL_ID>\n\nOptions:\n      --model-id \u003CMODEL_ID>\n          The Hugging Face model ID, can be any model listed on \u003Chttps:\u002F\u002Fhuggingface.co\u002Fmodels> with the `text-embeddings-inference` tag (meaning it's compatible with Text Embeddings Inference).\n\n          Alternatively, the specified ID can also be a path to a local directory containing the necessary model files saved by the `save_pretrained(...)` methods of either Transformers or Sentence Transformers.\n\n          [env: MODEL_ID=]\n\n      --revision \u003CREVISION>\n          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs\u002Fpr\u002F2`\n\n          [env: REVISION=]\n\n      --tokenization-workers \u003CTOKENIZATION_WORKERS>\n          Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation. Default to the number of CPU cores on the machine\n\n          [env: TOKENIZATION_WORKERS=]\n\n      --dtype \u003CDTYPE>\n          The dtype to be forced upon the model\n\n          [env: DTYPE=]\n          [possible values: float16, float32]\n\n      --served-model-name \u003CSERVED_MODEL_NAME>\n          The name of the model that is being served. If not specified, defaults to `--model-id`. It is only used for the OpenAI-compatible endpoints via HTTP\n\n          [env: SERVED_MODEL_NAME=]\n\n      --pooling \u003CPOOLING>\n          Optionally control the pooling method for embedding models.\n\n          If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling\u002Fconfig.json` configuration.\n\n          If `pooling` is set, it will override the model pooling configuration\n\n          [env: POOLING=]\n\n          Possible values:\n          - cls:        Select the CLS token as embedding\n          - mean:       Apply Mean pooling to the model embeddings\n          - splade:     Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only available if the loaded model is a `ForMaskedLM` Transformer model\n          - last-token: Select the last token as embedding\n\n      --max-concurrent-requests \u003CMAX_CONCURRENT_REQUESTS>\n          The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly\n\n          [env: MAX_CONCURRENT_REQUESTS=]\n          [default: 512]\n\n      --max-batch-tokens \u003CMAX_BATCH_TOKENS>\n          **IMPORTANT** This is one critical control to allow maximum usage of the available hardware.\n\n          This represents the total amount of potential tokens within a batch.\n\n          For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.\n\n          Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.\n\n          [env: MAX_BATCH_TOKENS=]\n          [default: 16384]\n\n      --max-batch-requests \u003CMAX_BATCH_REQUESTS>\n          Optionally control the maximum number of individual requests in a batch\n\n          [env: MAX_BATCH_REQUESTS=]\n\n      --max-client-batch-size \u003CMAX_CLIENT_BATCH_SIZE>\n          Control the maximum number of inputs that a client can send in a single request\n\n          [env: MAX_CLIENT_BATCH_SIZE=]\n          [default: 32]\n\n      --auto-truncate\n          Control automatic truncation of inputs that exceed the model's maximum supported size. Defaults to `true` (truncation enabled). Set to `false` to disable truncation; when disabled and the model's maximum input length exceeds `--max-batch-tokens`, the server will refuse to start with an error instead of silently truncating sequences.\n\n          Unused for gRPC servers\n\n          [env: AUTO_TRUNCATE=]\n\n      --default-prompt-name \u003CDEFAULT_PROMPT_NAME>\n          The name of the prompt that should be used by default for encoding. If not set, no prompt will be applied.\n\n          Must be a key in the `sentence-transformers` configuration `prompts` dictionary.\n\n          For example if ``default_prompt_name`` is \"query\" and the ``prompts`` is {\"query\": \"query: \", ...}, then the sentence \"What is the capital of France?\" will be encoded as \"query: What is the capital of France?\" because the prompt text will be prepended before any text to encode.\n\n          The argument '--default-prompt-name \u003CDEFAULT_PROMPT_NAME>' cannot be used with '--default-prompt \u003CDEFAULT_PROMPT>`\n\n          [env: DEFAULT_PROMPT_NAME=]\n\n      --default-prompt \u003CDEFAULT_PROMPT>\n          The prompt that should be used by default for encoding. If not set, no prompt will be applied.\n\n          For example if ``default_prompt`` is \"query: \" then the sentence \"What is the capital of France?\" will be encoded as \"query: What is the capital of France?\" because the prompt text will be prepended before any text to encode.\n\n          The argument '--default-prompt \u003CDEFAULT_PROMPT>' cannot be used with '--default-prompt-name \u003CDEFAULT_PROMPT_NAME>`\n\n          [env: DEFAULT_PROMPT=]\n\n      --dense-path \u003CDENSE_PATH>\n          Optionally, define the path to the Dense module required for some embedding models.\n\n          Some embedding models require an extra `Dense` module which contains a single Linear layer and an activation function. By default, those `Dense` modules are stored under the `2_Dense` directory, but there might be cases where different `Dense` modules are provided, to convert the pooled embeddings into different dimensions, available as `2_Dense_\u003Cdims>` e.g. https:\u002F\u002Fhuggingface.co\u002FNovaSearch\u002Fstella_en_400M_v5.\n\n          Note that this argument is optional, only required to be set if there is no `modules.json` file or when you want to override a single Dense module path, only when running with the `candle` backend.\n\n          [env: DENSE_PATH=]\n\n      --hf-token \u003CHF_TOKEN>\n          Your Hugging Face Hub token. If neither `--hf-token` nor `HF_TOKEN` are set, the token will be read from the `$HF_HOME\u002Ftoken` path, if it exists. This ensures access to private or gated models, and allows for a more permissive rate limiting\n\n          [env: HF_TOKEN=]\n\n      --hostname \u003CHOSTNAME>\n          The IP address to listen on\n\n          [env: HOSTNAME=]\n          [default: 0.0.0.0]\n\n      -p, --port \u003CPORT>\n          The port to listen on\n\n          [env: PORT=]\n          [default: 3000]\n\n      --uds-path \u003CUDS_PATH>\n          The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC\n\n          [env: UDS_PATH=]\n          [default: \u002Ftmp\u002Ftext-embeddings-inference-server]\n\n      --huggingface-hub-cache \u003CHUGGINGFACE_HUB_CACHE>\n          The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance\n\n          [env: HUGGINGFACE_HUB_CACHE=]\n\n      --payload-limit \u003CPAYLOAD_LIMIT>\n          Payload size limit in bytes\n\n          Default is 2MB\n\n          [env: PAYLOAD_LIMIT=]\n          [default: 2000000]\n\n      --api-key \u003CAPI_KEY>\n          Set an api key for request authorization.\n\n          By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token.\n\n          [env: API_KEY=]\n\n      --json-output\n          Outputs the logs in JSON format (useful for telemetry)\n\n          [env: JSON_OUTPUT=]\n\n      --disable-spans\n          Whether or not to include the log trace through spans\n\n          [env: DISABLE_SPANS=]\n\n      --otlp-endpoint \u003COTLP_ENDPOINT>\n          The grpc endpoint for opentelemetry. Telemetry is sent to this endpoint as OTLP over gRPC. e.g. `http:\u002F\u002Flocalhost:4317`\n\n          [env: OTLP_ENDPOINT=]\n\n      --otlp-service-name \u003COTLP_SERVICE_NAME>\n          The service name for opentelemetry. e.g. `text-embeddings-inference.server`\n\n          [env: OTLP_SERVICE_NAME=]\n          [default: text-embeddings-inference.server]\n\n      --prometheus-port \u003CPROMETHEUS_PORT>\n          The Prometheus port to listen on\n\n          [env: PROMETHEUS_PORT=]\n          [default: 9000]\n\n      --cors-allow-origin \u003CCORS_ALLOW_ORIGIN>\n          Unused for gRPC servers\n\n          [env: CORS_ALLOW_ORIGIN=]\n\n  -h, --help\n          Print help (see a summary with '-h')\n\n  -V, --version\n          Print version\n```\n\n### Docker Images\n\nText Embeddings Inference ships with multiple Docker images that you can use to target a specific backend:\n\n| Architecture                           | Platform | Image                                                                   |\n|----------------------------------------|----------|-------------------------------------------------------------------------|\n| CPU                                    | x86_64   | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.9                   |\n| CPU                                    | aarch64  | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-arm64-1.9             |\n| Volta                                  | x86_64   | NOT SUPPORTED                                                           |\n| Turing (T4, RTX 2000 series, ...)      | x86_64   | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:turing-1.9 (experimental) |\n| Ampere 8.0 (A100, A30)                 | x86_64   | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:1.9                       |\n| Ampere 8.6 (A10, A40, ...)             | x86_64   | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:86-1.9                    |\n| Ada Lovelace (RTX 4000 series, ...)    | x86_64   | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:89-1.9                    |\n| Hopper (H100)                          | x86_64   | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:hopper-1.9                |\n| Blackwell 10.0 (B200, GB200, ...)      | x86_64   | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:100-1.9 (experimental)    |\n| Blackwell 12.0 (GeForce RTX 50X0, ...) | x86_64   | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:120-1.9 (experimental)    |\n| Blackwell 12.1 (DGX Spark GB10, ...)   | multi    | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:121-1.9 (experimental)    |\n\n**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.\nYou can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.\n\n### API documentation\n\nYou can consult the OpenAPI documentation of the `text-embeddings-inference` REST API using the `\u002Fdocs` route.\nThe Swagger UI is also available\nat: [https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference](https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference).\n\n### Using a private or gated model\n\nYou have the option to utilize the `HF_TOKEN` environment variable for configuring the token employed by\n`text-embeddings-inference`. This allows you to gain access to protected resources.\n\nFor example:\n\n1. Go to https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\n2. Copy your CLI READ token\n3. Export `HF_TOKEN=\u003Cyour CLI READ token>`\n\nor with Docker:\n\n```shell\nmodel=\u003Cyour private model>\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\ntoken=\u003Cyour CLI READ token>\n\ndocker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\n### Air gapped deployment\n\nTo deploy Text Embeddings Inference in an air-gapped environment, first download the weights and then mount them inside\nthe container using a volume.\n\nFor example:\n\n```shell\n# (Optional) create a `models` directory\nmkdir models\ncd models\n\n# Make sure you have git-lfs installed (https:\u002F\u002Fgit-lfs.com)\ngit lfs install\ngit clone https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-Embedding-0.6B\n\n# Set the models directory as the volume path\nvolume=$PWD\n\n# Mount the models directory inside the container with a volume and set the model ID\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id \u002Fdata\u002FQwen3-Embedding-0.6B\n```\n\n### Using Re-rankers models\n\n`text-embeddings-inference` v0.4.0 added support for CamemBERT, RoBERTa, XLM-RoBERTa, and GTE Sequence Classification models.\nRe-rankers models are Sequence Classification cross-encoders models with a single class that scores the similarity\nbetween a query and a text.\n\nSee [this blogpost](https:\u002F\u002Fblog.llamaindex.ai\u002Fboosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) by\nthe LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve\ndownstream performance.\n\n```shell\nmodel=BAAI\u002Fbge-reranker-large\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\nAnd then you can rank the similarity between a query and a list of texts with:\n\n```bash\ncurl 127.0.0.1:8080\u002Frerank \\\n    -X POST \\\n    -d '{\"query\": \"What is Deep Learning?\", \"texts\": [\"Deep Learning is not...\", \"Deep learning is...\"]}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### Using Sequence Classification models\n\nYou can also use classic Sequence Classification models like `SamLowe\u002Froberta-base-go_emotions`:\n\n```shell\nmodel=SamLowe\u002Froberta-base-go_emotions\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\nOnce you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:\n\n```bash\ncurl 127.0.0.1:8080\u002Fpredict \\\n    -X POST \\\n    -d '{\"inputs\":\"I like you.\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### Using SPLADE pooling\n\nYou can choose to activate SPLADE pooling for Bert and Distilbert MaskedLM architectures:\n\n```shell\nmodel=naver\u002Fefficient-splade-VI-BT-large-query\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model --pooling splade\n```\n\nOnce you have deployed the model you can use the `\u002Fembed_sparse` endpoint to get the sparse embedding:\n\n```bash\ncurl 127.0.0.1:8080\u002Fembed_sparse \\\n    -X POST \\\n    -d '{\"inputs\":\"I like you.\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### Distributed Tracing\n\n`text-embeddings-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature\nby setting the address to an OTLP collector with the `--otlp-endpoint` argument.\n\n### gRPC\n\n`text-embeddings-inference` offers a gRPC API as an alternative to the default HTTP API for high performance\ndeployments. The API protobuf definition can be\nfound [here](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fblob\u002Fmain\u002Fproto\u002Ftei.proto).\n\nYou can use the gRPC API by adding the `-grpc` tag to any TEI Docker image. For example:\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9-grpc --model-id $model\n```\n\n```shell\ngrpcurl -d '{\"inputs\": \"What is Deep Learning\"}' -plaintext 0.0.0.0:8080 tei.v1.Embed\u002FEmbed\n```\n\n## Local install\n\n### Apple Silicon (Homebrew)\n\nOn Apple Silicon (M1\u002FM2\u002FM3\u002FM4), you can install a prebuilt binary via Homebrew:\n\n```shell\nbrew install text-embeddings-inference\n```\n\nThen launch Text Embeddings Inference with Metal acceleration:\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n### CPU\n\nYou can also opt to install `text-embeddings-inference` locally.\n\nFirst [install Rust](https:\u002F\u002Frustup.rs\u002F):\n\n```shell\ncurl --proto '=https' --tlsv1.2 -sSf https:\u002F\u002Fsh.rustup.rs | sh\n```\n\nThen run:\n\n```shell\n# On x86 with ONNX backend (recommended)\ncargo install --path router -F ort\n# On x86 with Intel backend\ncargo install --path router -F mkl\n# On M1 or M2\ncargo install --path router -F metal\n```\n\nYou can now launch Text Embeddings Inference on CPU with:\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:\n\n```shell\nsudo apt-get install libssl-dev gcc -y\n```\n\n### CUDA\n\nGPUs with CUDA compute capabilities \u003C 7.5 are not supported (V100, Titan V, GTX 1000 series, ...).\n\nMake sure you have CUDA and the NVIDIA drivers installed. NVIDIA drivers on your device need to be compatible with CUDA\nversion 12.2 or higher. You also need to add the NVIDIA binaries to your path:\n\n```shell\nexport PATH=$PATH:\u002Fusr\u002Flocal\u002Fcuda\u002Fbin\n```\n\nThen run the following (might take a while as it needs to compile the CUDA kernels):\n\n```shell\n# On Turing GPUs (T4, RTX 2000 series ... )\ncargo install --path router -F candle-cuda-turing\n\n# On Ampere, Ada Lovelace, Hopper and Blackwell\ncargo install --path router -F candle-cuda\n```\n\nYou can now launch Text Embeddings Inference on GPU as follows:\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n## Docker\n\nYou can build the CPU container with Docker as:\n\n```shell\ndocker build -f Dockerfile .\n```\n\nTo build the CUDA containers, you need to know the compute cap of the GPU you will be using\nat runtime, to build the image accordingly:\n\n```shell\n# Get submodule dependencies\ngit submodule update --init\n\n# Example for Turing (T4, RTX 2000 series, ...)\nruntime_compute_cap=75\n\n# Example for Ampere (A100, ...)\nruntime_compute_cap=80\n\n# Example for Ampere (A10, ...)\nruntime_compute_cap=86\n\n# Example for Ada Lovelace (RTX 4000 series, ...)\nruntime_compute_cap=89\n\n# Example for Hopper (H100, ...)\nruntime_compute_cap=90\n\n# Example for Blackwell (B200, GB200, ...)\nruntime_compute_cap=100\n\n# Example for Blackwell (GeForce RTX 50X0, RTX PRO 6000, ...)\nruntime_compute_cap=120\n\n# Example for Blackwell GB10 (DGX Spark)\nruntime_compute_cap=121\n\ndocker build . -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_cap\n```\n\n### ARM64 \u002F aarch64\n\n#### CPU-only (Apple Silicon, Ampere, Graviton)\n\nFor ARM64 hosts without NVIDIA GPUs, use the CPU Dockerfile. Inference runs on CPU cores\nonly (no Metal\u002FMPS support via Docker).\n\n```shell\ndocker build . -f Dockerfile-arm64 --platform=linux\u002Farm64\n```\n\n#### CUDA on ARM64 (DGX Spark, Jetson)\n\nFor ARM64 hosts with NVIDIA GPUs, build `Dockerfile-cuda` with the appropriate compute\ncapability and `--platform linux\u002Farm64`:\n\n```shell\n# DGX Spark (GB10, sm_121)\ndocker build . -f Dockerfile-cuda \\\n  --build-arg CUDA_COMPUTE_CAP=121 \\\n  --platform linux\u002Farm64\n\n# Future ARM64 + Blackwell devices (sm_120)\ndocker build . -f Dockerfile-cuda \\\n  --build-arg CUDA_COMPUTE_CAP=120 \\\n  --platform linux\u002Farm64\n```\n\n## AMD Instinct GPUs (ROCm) — experimental\n\nTEI has experimental support for AMD Instinct GPUs (MI200, MI300 series) via ROCm. You can use the `rocm\u002Fpytorch:latest` Docker image or a bare-metal ROCm installation. TEI will auto-detect the GPU at startup.\n\nFor full setup instructions, see the **[AMD Instinct GPU guide](https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference\u002Famd_gpu)**.\n\n## Examples\n\n- [Set up an Inference Endpoint with TEI](https:\u002F\u002Fhuggingface.co\u002Flearn\u002Fcookbook\u002Fautomatic_embedding_tei_inference_endpoints)\n- [RAG containers with TEI](https:\u002F\u002Fgithub.com\u002Fplaggy\u002Frag-containers)\n","Text Embeddings Inference 是一个针对文本嵌入模型的高性能推理解决方案。该项目使用 Rust 语言开发，支持多种流行的文本嵌入和序列分类模型，如FlagEmbedding、Ember等，并具备无模型图编译步骤、Metal支持Mac本地执行、小体积Docker镜像快速启动、基于令牌的动态批处理以及优化后的Transformer代码用于推理等功能。它特别适合需要高效处理大量文本数据并提取特征向量的应用场景，例如自然语言处理任务中的语义搜索、推荐系统或内容分类等。","2026-06-11 03:05:00","top_language"]