[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72431":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":14,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},72431,"colpali","illuin-tech\u002Fcolpali","illuin-tech","The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.","https:\u002F\u002Fhuggingface.co\u002Fvidore",null,"Python",2663,256,21,3,0,7,13,42,83.93,"MIT License",false,"main",[5,25,26,27,28,29],"colqwen2","colsmol","information-retrieval","retrieval-augmented-generation","vision-language-model","2026-06-12 04:01:05","# ColPali: Efficient Document Retrieval with Vision Language Models 👀\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2407.01449-b31b1b.svg?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01449)\n[![GitHub](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FViDoRe_Benchmark-100000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002Filluin-tech\u002Fvidore-benchmark)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVidore_Hf_Space-FFD21E?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fvidore)\n[![GitHub](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCookbooks-100000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002Ftonywu71\u002Fcolpali-cookbooks)\n\n[![Test](https:\u002F\u002Fgithub.com\u002Filluin-tech\u002Fcolpali\u002Factions\u002Fworkflows\u002Ftest.yml\u002Fbadge.svg?branch=main)](https:\u002F\u002Fgithub.com\u002Filluin-tech\u002Fcolpali\u002Factions\u002Fworkflows\u002Ftest.yml)\n[![Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fcolpali-engine?color=%2334D058&label=pypi%20package)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcolpali-engine\u002F)\n[![Downloads](https:\u002F\u002Fstatic.pepy.tech\u002Fbadge\u002Fcolpali-engine)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fcolpali-engine)\n\n---\n\n[[Model card]](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolpali)\n[[ViDoRe Leaderboard]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fvidore\u002Fvidore-leaderboard)\n[[Demo]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmanu\u002FColPali-demo)\n[[Blog Post]](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fmanu\u002Fcolpali)\n\n## Associated Paper\n\nThis repository contains the code used for training and running visual document retrievers introduced by the [*ColPali: Efficient Document Retrieval with Vision Language Models*](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01449) paper. It includes the original ColPali model, based on the ColBERT architecture and the PaliGemma model, along with later ColVision and bi-encoder retriever variants.\n\n## Introduction\n\nWith *ColPali*, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.\n\nUsing ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document.\n\n![ColPali Architecture](assets\u002Fcolpali_architecture.webp)\n\n## List of ColVision models\n\n| Model                                                               | Score on [ViDoRe](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fvidore\u002Fvidore-leaderboard) 🏆 | License    | Comments                                                                                                                                                       | Currently supported |\n|---------------------------------------------------------------------|-------------------------------------------------------------------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|\n| [vidore\u002Fcolpali](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolpali)             | 81.3                                                                          | Gemma      | • Based on `google\u002Fpaligemma-3b-mix-448`.\u003Cbr \u002F>• Checkpoint used in the ColPali paper.                                                                         | ❌                   |\n| [vidore\u002Fcolpali-v1.1](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolpali-v1.1)   | 81.5                                                                          | Gemma      | • Based on `google\u002Fpaligemma-3b-mix-448`.\u003Cbr \u002F>• Fix right padding for queries.                                                                                | ✅                   |\n| [vidore\u002Fcolpali-v1.2](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolpali-v1.2)   | 83.9                                                                          | Gemma      | • Similar to `vidore\u002Fcolpali-v1.1`.                                                                                                                            | ✅                   |\n| [vidore\u002Fcolpali-v1.3](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolpali-v1.3)   | 84.8                                                                          | Gemma      | • Similar to `vidore\u002Fcolpali-v1.2`.\u003Cbr \u002F>• Trained with a larger effective batch size of 256 batch size for 3 epochs.                                          | ✅                   |\n| [vidore\u002Fcolqwen2-v0.1](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolqwen2-v0.1) | 87.3                                                                          | Apache 2.0 | • Based on `Qwen\u002FQwen2-VL-2B-Instruct`.\u003Cbr \u002F>• Supports dynamic resolution.\u003Cbr \u002F>• Trained using 768 image patches per page and an effective batch size of 32. | ✅                   |\n| [vidore\u002Fcolqwen2-v1.0](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolqwen2-v1.0) | 89.3                                                                          | Apache 2.0 | • Similar to `vidore\u002Fcolqwen2-v0.1`, but trained with more powerful GPUs and with a larger effective batch size (256).                                         | ✅                   |\n| [vidore\u002Fcolqwen2.5-v0.1](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolqwen2.5-v0.1) | 88.8                                                                          | Apache 2.0 | • Based on `Qwen\u002FQwen2.5-VL-3B-Instruct`\u003Cbr \u002F>• Supports dynamic resolution.\u003Cbr \u002F>• Trained using 768 image patches per page and an effective batch size of 32.                                         | ✅                   |\n| [vidore\u002Fcolqwen2.5-v0.2](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002Fcolqwen2.5-v0.2) | 89.4                                                                          | Apache 2.0 | • Similar to `vidore\u002Fcolqwen2.5-v0.1`, but trained with slightly different hyper parameters                                        | ✅                   |\n| [TomoroAI\u002Ftomoro-colqwen3-embed-4b](https:\u002F\u002Fhuggingface.co\u002FTomoroAI\u002Ftomoro-colqwen3-embed-4b) | 90.6                                                                           | Apache 2.0 | • Based on the Qwen3-VL backbone.\u003Cbr \u002F>• 320-dim ColBERT-style embeddings with dynamic resolution.\u003Cbr \u002F>• Trained for multi-vector document retrieval.          | ✅                   |\n| [athrael-soju\u002Fcolqwen3.5-4.5B-v3](https:\u002F\u002Fhuggingface.co\u002Fathrael-soju\u002Fcolqwen3.5-4.5B-v3) | 90.9                                                                           | Apache 2.0 | • Based on `Qwen\u002FQwen3.5-4B` (hybrid GatedDeltaNet + full-attention).\u003Cbr \u002F>• 320-dim ColBERT-style embeddings.\u003Cbr \u002F>• 4.5B params, LoRA-trained.          | ✅                   |\n| [vidore\u002FcolSmol-256M](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002FcolSmol-256M)   | 80.1                                                                          | Apache 2.0 | • Based on `HuggingFaceTB\u002FSmolVLM-256M-Instruct`.                                                                                                              | ✅                   |\n| [vidore\u002FcolSmol-500M](https:\u002F\u002Fhuggingface.co\u002Fvidore\u002FcolSmol-500M)   | 82.3                                                                          | Apache 2.0 | • Based on `HuggingFaceTB\u002FSmolVLM-500M-Instruct`.                                                                                                              | ✅                   |\n| [Cognitive-Lab\u002FColNetraEmbed](https:\u002F\u002Fhuggingface.co\u002FCognitive-Lab\u002FColNetraEmbed) | 86.4                                                                          | Gemma      | • Based on `google\u002Fgemma-3-4b-it`.\u003Cbr \u002F>• Multi-vector late interaction retrieval model.\u003Cbr \u002F>• Multilingual support across 22 languages.      | ✅                   |\n| [Cognitive-Lab\u002FNetraEmbed](https:\u002F\u002Fhuggingface.co\u002FCognitive-Lab\u002FNetraEmbed)       | 81.0                                                                          | Gemma      | • Based on `google\u002Fgemma-3-4b-it`.\u003Cbr \u002F>• Bi-encoder retrieval model.\u003Cbr \u002F>• Supports Matryoshka embeddings (768, 1536, 2560).\u003Cbr \u002F>• Multilingual support across 22 languages. | ✅                   |\n\n## Setup\n\nThe codebase is compatible with Python >=3.10,\u003C3.15 and recent PyTorch versions. To install the package, run:\n\n```bash\npip install colpali-engine # from PyPI\npip install git+https:\u002F\u002Fgithub.com\u002Filluin-tech\u002Fcolpali # from source\n```\n\nMac users using MPS with the ColQwen models have reported errors with torch 2.6.0. These errors are fixed by downgrading to torch 2.5.1.\n\n> [!WARNING]\n> For ColPali versions above v1.0, make sure to install the `colpali-engine` package from source or with a version above v0.2.0.\n\n## Development docs\n\n- [Adding a new model family](docs\u002Fadd_model_family.md)\n\n## Usage\n\n### Quick start\n\n```python\nimport torch\nfrom PIL import Image\nfrom transformers.utils.import_utils import is_flash_attn_2_available\n\nfrom colpali_engine.models import ColQwen2, ColQwen2Processor\n\nmodel_name = \"vidore\u002Fcolqwen2-v1.0\"\n\nmodel = ColQwen2.from_pretrained(\n    model_name,\n    torch_dtype=torch.bfloat16,\n    device_map=\"cuda:0\",  # or \"mps\" if on Apple Silicon\n    attn_implementation=\"flash_attention_2\" if is_flash_attn_2_available() else None,\n).eval()\n\nprocessor = ColQwen2Processor.from_pretrained(model_name)\n\n# Your inputs\nimages = [\n    Image.new(\"RGB\", (128, 128), color=\"white\"),\n    Image.new(\"RGB\", (64, 32), color=\"black\"),\n]\nqueries = [\n    \"What is the organizational structure for our R&D department?\",\n    \"Can you provide a breakdown of last year’s financial performance?\",\n]\n\n# Process the inputs\nbatch_images = processor.process_images(images).to(model.device)\nbatch_queries = processor.process_queries(queries).to(model.device)\n\n# Forward pass\nwith torch.no_grad():\n    image_embeddings = model(**batch_images)\n    query_embeddings = model(**batch_queries)\n\nscores = processor.score_multi_vector(query_embeddings, image_embeddings)\n```\n\nWe now support `fast-plaid` experimentally to make matching quicker for larger corpus sizes:\n\n```python\n# !pip install --no-deps fast-plaid fastkmeans\n\n# Process the inputs by batches of 4\ndataloader = DataLoader(\n    dataset=images,\n    batch_size=4,\n    shuffle=False,\n    collate_fn=lambda x: processor.process_images(x),\n)\n\nds  = []\nfor batch_doc in tqdm(dataloader):\n    with torch.no_grad():\n        batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}\n        embeddings_doc = model(**batch_doc)\n    ds.extend(list(torch.unbind(embeddings_doc.to(\"cpu\"))))\n\nplaid_index = processor.create_plaid_index(ds)\n\nscores = processor.get_topk_plaid(query_embeddings, plaid_index, k=10)\n```\n\n### Benchmarking\n\nTo benchmark ColPali on the [ViDoRe leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fvidore\u002Fvidore-leaderboard), use the [`vidore-benchmark`](https:\u002F\u002Fgithub.com\u002Filluin-tech\u002Fvidore-benchmark) package.\n\n### Interpretability with similarity maps\n\nBy superimposing the late interaction similarity maps on top of the original image, we can visualize the most salient image patches with respect to each term of the query, yielding interpretable insights into model focus zones.\n\nTo use the `interpretability` module, you need to install the `colpali-engine[interpretability]` package:\n\n```bash\npip install colpali-engine[interpretability]\n```\n\nThen, after generating your embeddings with ColPali, use the following code to plot the similarity maps for each query token:\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🔽 Click to expand code snippet\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport torch\nfrom PIL import Image\n\nfrom colpali_engine.interpretability import (\n    get_similarity_maps_from_embeddings,\n    plot_all_similarity_maps,\n)\nfrom colpali_engine.models import ColPali, ColPaliProcessor\nfrom colpali_engine.utils.torch_utils import get_torch_device\n\nmodel_name = \"vidore\u002Fcolpali-v1.3\"\ndevice = get_torch_device(\"auto\")\n\n# Load the model\nmodel = ColPali.from_pretrained(\n    model_name,\n    torch_dtype=torch.bfloat16,\n    device_map=device,\n).eval()\n\n# Load the processor\nprocessor = ColPaliProcessor.from_pretrained(model_name)\n\n# Load the image and query\nimage = Image.open(\"shift_kazakhstan.jpg\")\nquery = \"Quelle partie de la production pétrolière du Kazakhstan provient de champs en mer ?\"\n\n# Preprocess inputs\nbatch_images = processor.process_images([image]).to(device)\nbatch_queries = processor.process_queries([query]).to(device)\n\n# Forward passes\nwith torch.no_grad():\n    image_embeddings = model.forward(**batch_images)\n    query_embeddings = model.forward(**batch_queries)\n\n# Get the number of image patches\nn_patches = processor.get_n_patches(image_size=image.size, patch_size=model.patch_size)\n\n# Get the tensor mask to filter out the embeddings that are not related to the image\nimage_mask = processor.get_image_mask(batch_images)\n\n# Generate the similarity maps\nbatched_similarity_maps = get_similarity_maps_from_embeddings(\n    image_embeddings=image_embeddings,\n    query_embeddings=query_embeddings,\n    n_patches=n_patches,\n    image_mask=image_mask,\n)\n\n# Get the similarity map for our (only) input image\nsimilarity_maps = batched_similarity_maps[0]  # (query_length, n_patches_x, n_patches_y)\n\n# Tokenize the query\nquery_tokens = processor.tokenizer.tokenize(query)\n\n# Plot and save the similarity maps for each query token\nplots = plot_all_similarity_maps(\n    image=image,\n    query_tokens=query_tokens,\n    similarity_maps=similarity_maps,\n)\nfor idx, (fig, ax) in enumerate(plots):\n    fig.savefig(f\"similarity_map_{idx}.png\")\n```\n\n\u003C\u002Fdetails>\n\nFor a more detailed example, you can refer to the interpretability notebooks from the [ColPali Cookbooks 👨🏻‍🍳](https:\u002F\u002Fgithub.com\u002Ftonywu71\u002Fcolpali-cookbooks) repository.\n\n### Token pooling\n\n[Token pooling](https:\u002F\u002Fdoi.org\u002F10.48550\u002FarXiv.2409.14683) is a CRUDE-compliant method (document addition\u002Fdeletion-friendly) that aims at reducing the sequence length of multi-vector embeddings. For ColPali, many image patches share redundant information, e.g. white background patches. By pooling these patches together, we can reduce the amount of embeddings while retaining most of the page's signal. Retrieval performance with hierarchical mean token pooling on image embeddings can be found in the [ColPali paper](https:\u002F\u002Fdoi.org\u002F10.48550\u002FarXiv.2407.01449). In our experiments, we found that a pool factor of 3 offered the optimal trade-off: the total number of vectors is reduced by $66.7\\%$ while $97.8\\%$ of the original performance is maintained.\n\nTo use token pooling, you can use the `HierarchicalTokenPooler` class from the `colpali-engine` package:\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🔽 Click to expand code snippet\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport torch\n\nfrom colpali_engine.compression.token_pooling import HierarchicalTokenPooler\n\n# Dummy multivector embeddings\nlist_embeddings = [\n    torch.rand(10, 768),\n    torch.rand(20, 768),\n]\n\n# Define the pooler with the desired level of compression\npooler = HierarchicalTokenPooler()\n\n# Pool the embeddings\noutputs = pooler.pool_embeddings(list_embeddings, pool_factor=2)\n```\n\nIf your inputs are padded 3D tensor embeddings instead of lists of 2D tensors, use `padding=True` and specify the padding used by your tokenizer to make sure the `HierarchicalTokenPooler` correctly removes the padding values before pooling:\n\n```python\nimport torch\nfrom PIL import Image\nfrom transformers.utils.import_utils import is_flash_attn_2_available\n\nfrom colpali_engine.compression.token_pooling import HierarchicalTokenPooler\nfrom colpali_engine.models import ColQwen2, ColQwen2Processor\n\nmodel_name = \"vidore\u002Fcolqwen2-v1.0\"\nmodel = ColQwen2.from_pretrained(\n    model_name,\n    torch_dtype=torch.bfloat16,\n    device_map=\"cuda:0\",  # or \"mps\" if on Apple Silicon\n    attn_implementation=\"flash_attention_2\" if is_flash_attn_2_available() else None,\n).eval()\nprocessor = ColQwen2Processor.from_pretrained(model_name)\n\ntoken_pooler = HierarchicalTokenPooler()\n\n# Your page images\nimages = [\n    Image.new(\"RGB\", (128, 128), color=\"white\"),\n    Image.new(\"RGB\", (32, 32), color=\"black\"),\n]\n\n# Process the inputs\nbatch_images = processor.process_images(images).to(model.device)\n\n# Forward pass\nwith torch.no_grad():\n    image_embeddings = model(**batch_images)\n\n# Apply token pooling (reduces the sequence length of the multi-vector embeddings)\nimage_embeddings = token_pooler.pool_embeddings(\n    image_embeddings,\n    pool_factor=2,\n    padding=True,\n    padding_side=processor.tokenizer.padding_side,\n)\n```\n\n\u003C\u002Fdetails>\n\n### Training\n\nTo keep a lightweight repository, only the essential packages were installed. In particular, you must specify the dependencies to use the training script for ColPali. You can do this using the following command:\n\n```bash\npip install \"colpali-engine[train]\"\n```\n\nAll the model configs used can be found in `scripts\u002Fconfigs\u002F` and rely on the [configue](https:\u002F\u002Fgithub.com\u002Filluin-tech\u002Fconfigue) package for straightforward configuration. They should be used with the `train_colbert.py` script.\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🔽 Example 1: Local training\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n\n```bash\naccelerate launch --multi-gpu scripts\u002Fconfigs\u002Fqwen2\u002Ftrain_colqwen25_model.py\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🔽 Example 2: Training on a SLURM cluster\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```bash\nsbatch --nodes=1 --cpus-per-task=16 --mem-per-cpu=32GB --time=20:00:00 --gres=gpu:1  -p gpua100 --job-name=colidefics --output=colidefics.out --error=colidefics.err --wrap=\"accelerate launch scripts\u002Ftrain\u002Ftrain_colbert.py scripts\u002Fconfigs\u002Fpali\u002Ftrain_colpali_docmatix_hardneg_model.yaml\"\n\nsbatch --nodes=1  --time=5:00:00 -A cad15443 --gres=gpu:8  --constraint=MI250 --job-name=colpali --wrap=\"accelerate launch --multi-gpu scripts\u002Fconfigs\u002Fqwen2\u002Ftrain_colqwen25_model.py\"\n```\n\n\u003C\u002Fdetails>\n\n## Contributing\n\nWe welcome contributions to ColPali! 🤗\n\nTo contribute to ColPali, first install the development dependencies for proper testing\u002Flinting:\n\n```bash\npip install \"colpali-engine[dev]\"\n```\n\nTo run all the tests, you will have to install all optional dependencies (or you'll get an error in test discovery):\n\n```bash\npip install \"colpali-engine[all]\"\n```\n\nWhen your PR is ready, ping one of the repository maintainers. We will do our best to review it as soon as possible!\n\n## Community Projects\n\nSeveral community projects and resources have been developed around ColPali to facilitate its usage. Feel free to reach out if you want to add your project to this list!\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🔽 Libraries 📚\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n| Library Name  | Description                                                                                                                                                                                                                                          |\n|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  |\n| Byaldi        | [`Byaldi`](https:\u002F\u002Fgithub.com\u002FAnswerDotAI\u002Fbyaldi) is [RAGatouille](https:\u002F\u002Fgithub.com\u002FAnswerDotAI\u002FRAGatouille)'s equivalent for ColPali, leveraging the `colpali-engine` package to facilitate indexing and storing embeddings.                      |\n| PyVespa       | [`PyVespa`](https:\u002F\u002Fpyvespa.readthedocs.io\u002Fen\u002Flatest\u002Fexamples\u002Fcolpali-document-retrieval-vision-language-models-cloud.html) allows interaction with [Vespa](https:\u002F\u002Fvespa.ai\u002F), a production-grade vector database, with detailed ColPali support.   |\n| Qdrant | Tutorial about using ColQwen2 with the [Qdrant](https:\u002F\u002Fqdrant.tech\u002Fdocumentation\u002Fadvanced-tutorials\u002Fpdf-retrieval-at-scale\u002F) vector database. |\n| Elastic Search     | Tutorial about using ColPali with the [Elastic Search](https:\u002F\u002Fwww.elastic.co\u002Fsearch-labs\u002Fblog\u002Felastiacsearch-colpali-document-search) vector database. |\n| Weaviate | Tutorial about using multi-vector embeddings with the [Weaviate](https:\u002F\u002Fweaviate.io\u002Fdevelopers\u002Fweaviate\u002Ftutorials\u002Fmulti-vector-embeddings) vector database. |\n| Candle        | [Candle](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fcandle\u002Ftree\u002Fmain\u002Fcandle-examples\u002Fexamples\u002Fcolpali) enables ColPali inference with an efficient ML framework for Rust.                                                                                        |\n| EmbedAnything | [`EmbedAnything`](https:\u002F\u002Fgithub.com\u002FStarlightSearch\u002FEmbedAnything) Allows end-to-end ColPali inference with both Candle and ONNX backend.                                                                                                           |\n| DocAI         | [DocAI](https:\u002F\u002Fgithub.com\u002FPragmaticMachineLearning\u002Fdocai) uses ColPali with GPT-4o and Langchain to extract structured information from documents.                                                                                                  |\n| VARAG         | [VARAG](https:\u002F\u002Fgithub.com\u002Fadithya-s-k\u002FVARAG) uses ColPali in a vision-only and a hybrid RAG pipeline.                                                                                                                                               |\n| ColBERT Live! | [`ColBERT Live!`](https:\u002F\u002Fgithub.com\u002Fjbellis\u002Fcolbert-live\u002F) enables ColPali usage with vector databases supporting large datasets, compression, and non-vector predicates.                                                                           |\n| ColiVara      | [`ColiVara`](https:\u002F\u002Fgithub.com\u002Ftjmlabs\u002FColiVara\u002F) is retrieval API that allows you to store, search, and retrieve documents based on their visual embedding. It is a web-first implementation of the ColPali paper using ColQwen2 as the LLM model. |\n| BentoML       | Deploy ColPali easily with BentoML using [this example repository](https:\u002F\u002Fgithub.com\u002Fbentoml\u002FBentoColPali). BentoML features adaptive batching and zero-copy I\u002FO to minimize overhead.                                                              |\n| NoOCR       | NoOCR is end-to-end, [open source](https:\u002F\u002Fgithub.com\u002Fkyryl-opens-ml\u002Fno-ocr) solution for complex PDFs, powered by ColPali embeddings. |\n| Astra Multi-vector     | [`Astra-multivector`](https:\u002F\u002Fgithub.com\u002Fbrian-ogrady\u002Fastradb-multivector) provides enterprise-grade integration with AstraDB for late-interaction models like ColPali, ColQwen2, and ColBERT. It implements efficient token pooling and embedding caching strategies to dramatically reduce latency and index size while maintaining retrieval quality. The library leverages Cassandra's distributed architecture for high-throughput vector search at scale. |\n| Mixpeek       | [Mixpeek](https:\u002F\u002Fdocs.mixpeek.com\u002Fprocessing\u002Ffeature-extractors) is a production platform for multimodal late-interaction retrieval. It supports models like ColBERT, ColPaLI, and ColQwen2 with built-in indexing, versioning, A\u002FB testing, and explainability across image, text, video, and PDF pipelines. |\n\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🔽 Notebooks 📙\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n| Notebook Title                                               | Author & Link                                                |\n| ------------------------------------------------------------ | ------------------------------------------------------------ |\n| ColPali Cookbooks                                            | [Tony's Cookbooks (ILLUIN)](https:\u002F\u002Fgithub.com\u002Ftonywu71\u002Fcolpali-cookbooks) 🙋🏻 |\n| Vision RAG Tutorial                                          | [Manu's Vision Rag Tutorial (ILLUIN)](https:\u002F\u002Fgithub.com\u002FManuelFay\u002FTutorials\u002Fblob\u002Fmain\u002FTuesday_Practical_2_Vision_RAG.ipynb) 🙋🏻 |\n| ColPali (Byaldi) + Qwen2-VL for RAG                          | [Merve's Notebook (HuggingFace 🤗)](https:\u002F\u002Fgithub.com\u002Fmerveenoyan\u002Fsmol-vision\u002Fblob\u002Fmain\u002FColPali_%2B_Qwen2_VL.ipynb) |\n| Indexing ColPali with Qdrant                                 | [Daniel's Notebook (HuggingFace 🤗)](https:\u002F\u002Fdanielvanstrien.xyz\u002Fposts\u002Fpost-with-code\u002Fcolpali-qdrant\u002F2024-10-02_using_colpali_with_qdrant.html) |\n| Weaviate Tutorial                                            | [Connor's ColPali POC (Weaviate)](https:\u002F\u002Fgithub.com\u002Fweaviate\u002Frecipes\u002Fblob\u002Fmain\u002Fweaviate-features\u002Fnamed-vectors\u002FNamedVectors-ColPali-POC.ipynb) |\n| Use ColPali for Multi-Modal Retrieval with Milvus            | [Milvus Documentation](https:\u002F\u002Fmilvus.io\u002Fdocs\u002Fuse_ColPali_with_milvus.md) |\n| Data Generation                                              | [Daniel's Notebook (HuggingFace 🤗)](https:\u002F\u002Fdanielvanstrien.xyz\u002Fposts\u002Fpost-with-code\u002Fcolpali\u002F2024-09-23-generate_colpali_dataset.html) |\n| Finance Report Analysis with ColPali and Gemini              | [Jaykumaran (LearnOpenCV)](https:\u002F\u002Fgithub.com\u002Fspmallick\u002Flearnopencv\u002Ftree\u002Fmaster\u002FMultimodal-RAG-with-ColPali-Gemini) |\n| Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs) | [Sergio Paniego](https:\u002F\u002Fhuggingface.co\u002Flearn\u002Fcookbook\u002Fmultimodal_rag_using_document_retrieval_and_vlms) |\n| Document Similarity Search with ColPali                      | [Frank Sommers](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ffsommers\u002Fdocumentai\u002Fblob\u002Fmain\u002FDocument_Similarity_with_ColPali_0_2_2_version.ipynb) |\n| End-to-end ColPali inference with EmbedAnything              | [Akshay Ballal (EmbedAnything)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1-Eiaw8wMm8I1n69N1uKOHkmpw3yV22w8?usp=sharing) |\n| ColiVara: A ColPali Retrieval API                            | [A simple RAG Example](https:\u002F\u002Fgithub.com\u002Ftjmlabs\u002FColiVara-docs\u002Fblob\u002Fmain\u002Fcookbook\u002FRAG.ipynb) |\n| Multimodal RAG with Document Retrieval (ColPali), Vision Language Model (ColQwen2) and Amazon Nova | [Suman's Notebook (AWS)](https:\u002F\u002Fgithub.com\u002Fdebnsuma\u002Ffcc-ai-engineering-aws\u002Fblob\u002Fmain\u002F05-multimodal-rag-with-colpali\u002F01-multimodal-retrival-with-colpali-retreve-gen.ipynb) |\n| Multi-vector RAG: Using Weaviate to search a collection of PDF documents | [Weaviate's Notebook](https:\u002F\u002Fgithub.com\u002Fweaviate\u002Frecipes\u002Fblob\u002Fmain\u002Fweaviate-features\u002Fmulti-vector\u002Fmulti-vector-colipali-rag.ipynb) |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🔽 Other resources\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 📝 = blog post\n- 📋 = PDF \u002F slides\n- 📹 = video\n\n| Title                                                                                    | Author & Link                                                                                                                                                 |\n|------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| State of AI report 2024                                                                  | [Nathan's report](https:\u002F\u002Fwww.stateof.ai\u002F) 📋                                                                                                                 |\n| Technology Radar Volume 31 (October 2024)                                                | [thoughtworks's report](https:\u002F\u002Fwww.thoughtworks.com\u002Fradar) 📋                                                                                                |\n| LlamaIndex Webinar: ColPali - Efficient Document Retrieval with Vision Language Models   | [LlamaIndex's Youtube video](https:\u002F\u002Fyoutu.be\u002FnzcBvba7mzI?si=WL9MsyiAFJMyEolz) 📹                                                                             |\n| PDF Retrieval with Vision Language Models                                                | [Jo's blog post #1 (Vespa)](https:\u002F\u002Fblog.vespa.ai\u002Fretrieval-with-vision-language-models-colpali\u002F) 📝                                                          |\n| Scaling ColPali to billions of PDFs with Vespa                                           | [Jo's blog post #2 (Vespa)](https:\u002F\u002Fblog.vespa.ai\u002Fscaling-colpali-to-billions\u002F) 📝                                                                            |\n| Neural Search Talks: ColPali (with Manuel Faysse)                                        | [Zeta Alpha's Podcast](https:\u002F\u002Fopen.spotify.com\u002Fepisode\u002F2s6ljhd6VQTL2mIU9cFzCb) 📹                                                                            |\n| Multimodal Document RAG with Llama 3.2 Vision and ColQwen2                               | [Zain's blog post (Together AI)](https:\u002F\u002Fwww.together.ai\u002Fblog\u002Fmultimodal-document-rag-with-llama-3-2-vision-and-colqwen2) 📝                                  |\n| ColPali: Document Retrieval with Vision Language Models                                  | [Antaripa Saha](https:\u002F\u002Fantaripasaha.notion.site\u002FColPali-Efficient-Document-Retrieval-with-Vision-Language-Models-10f5314a5639803d94d0d7ac191bb5b1) 📝        |\n| Minimalist diagrams explaining ColPali                                                   | [Leonie's ColPali diagrams on X](https:\u002F\u002Ftwitter.com\u002Fhelloiamleonie\u002Fstatus\u002F1839321865195851859)📝                                                            |\n| Multimodal RAG with ColPali and Gemini : Financial Report Analysis Application           | [Jaykumaran's blog post (LearnOpenCV)](https:\u002F\u002Flearnopencv.com\u002Fmultimodal-rag-with-colpali\u002F) 📝                                                               |\n| Implement Multimodal RAG with ColPali and Vision Language Model Groq(Llava) and Qwen2-VL | [Plaban's blog post](https:\u002F\u002Fmedium.com\u002Fthe-ai-forum\u002Fimplement-multimodal-rag-with-colpali-and-vision-language-model-groq-llava-and-qwen2-vl-5c113b8c08fd) 📝 |\n| multimodal AI. open-source. in a nutshell.                                               | [Merve's Youtube video](https:\u002F\u002Fyoutu.be\u002FIoGaGfU1CIg?si=yEhxMqJYxvMzGyUm) 📹                                                                                  |\n| Remove Complexity from Your RAG Applications                                             | [Kyryl's blog post (KOML)](https:\u002F\u002Fkyrylai.com\u002F2024\u002F09\u002F09\u002Fremove-complexity-from-your-rag-applications\u002F) 📝                                                   |\n| Late interaction & efficient Multi-modal retrievers need more than a vector index        | [Ayush Chaurasia (LanceDB)](https:\u002F\u002Fblog.lancedb.com\u002Flate-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index\u002F) 📝                |\n| Optimizing Document Retrieval with ColPali and Qdrant's Binary Quantization              | [Sabrina Aquino (Qdrant)]( https:\u002F\u002Fyoutu.be\u002F_A90A-grwIc?si=MS5RV17D6sgirCRm)  📹                                                                              |\n| Hands-On Multimodal Retrieval and Interpretability (ColQwen + Vespa)                     | [Antaripa Saha](https:\u002F\u002Fwww.analyticsvidhya.com\u002Fblog\u002F2024\u002F10\u002Fmultimodal-retrieval-with-colqwen-vespa\u002F) 📝                                                     |\n\n\u003C\u002Fdetails>\n\n## Paper result reproduction\n\nTo reproduce the results from the paper, you should checkout to the `v0.1.1` tag or install the corresponding `colpali-engine` package release using:\n\n```bash\npip install colpali-engine==0.1.1\n```\n\n## Citation\n\n**ColPali: Efficient Document Retrieval with Vision Language Models**  \n\nAuthors: **Manuel Faysse**\\*, **Hugues Sibille**\\*, **Tony Wu**\\*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (\\* denotes equal contribution)\n\n```latex\n@misc{faysse2024colpaliefficientdocumentretrieval,\n      title={ColPali: Efficient Document Retrieval with Vision Language Models}, \n      author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},\n      year={2024},\n      eprint={2407.01449},\n      archivePrefix={arXiv},\n      primaryClass={cs.IR},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01449}, \n}\n\n@misc{macé2025vidorebenchmarkv2raising,\n      title={ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval}, \n      author={Quentin Macé and António Loison and Manuel Faysse},\n      year={2025},\n      eprint={2505.17166},\n      archivePrefix={arXiv},\n      primaryClass={cs.IR},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17166}, \n}\n```\n","ColPali是一个利用视觉语言模型进行文档检索的项目。其核心功能是通过将PaliGemma-3B的ViT输出块线性投影，生成文档的多向量表示，并训练模型以最大化这些文档嵌入与查询嵌入之间的相似度，从而实现高效的文档检索。该项目基于ColBERT架构，能够处理文档中的文本和视觉内容（如布局、图表等），无需复杂的布局识别和OCR流程。适用于需要从大量视觉文档中快速准确检索信息的场景，例如法律文件、科研论文或历史档案的搜索。",2,"2026-06-11 03:42:01","high_star"]