[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72377":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72377,"NeMo-Retriever","NVIDIA\u002FNeMo-Retriever","NVIDIA","NeMo Retriever Library is a scalable, performance-oriented document content and metadata extraction microservice. NeMo Retriever Library uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images that you can use in downstream generative applications.","https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fretriever\u002Flatest\u002Fextraction\u002Foverview\u002F",null,"Python",2936,324,28,119,0,3,5,13,9,69.34,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:04","\u003C!--\nSPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.\nAll rights reserved.\nSPDX-License-Identifier: Apache-2.0\n-->\n\n**Important: The default branch is main, which tracks active development and may be ahead of the latest supported release.**\n\nFor the latest stable release use the [release\u002F26.03 branch](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Retriever\u002Ftree\u002F26.03).\n\nSee the corresponding [NeMo Retriever Library documentation](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fretriever\u002Flatest\u002Fextraction\u002Foverview\u002F).\n\n# NeMo Retriever Library\n\nNeMo Retriever Library is a scalable, performance-oriented framework for document content and metadata extraction. It supports both NVIDIA NIM microservices and a wide range of models to find, contextualize, and extract text, tables, charts, and infographics for use in downstream generative and retrieval-augmented applications.\n\n> [!Note]\n> NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.\n\nNeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. From there, NeMo Retriever Library manages computaiton of embeddings for the extracted content as well as storing them in [LanceDB](https:\u002F\u002Flancedb.com\u002F).\n\nThe following diagram shows the NeMo Retriever Library pipeline.\n\n![Pipeline Overview](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fretriever\u002Fextraction\u002Fimages\u002Foverview-extraction.png)\n\nFor production-level performance and scalability, we recommend that you deploy the pipeline and supporting NIMs by using Kubernetes ([helm charts](nemo_retriever\u002Fhelm)). For more information, refer to [prerequisites](https:\u002F\u002Fdocs.nvidia.com\u002Fnv-ingest\u002Fuser-guide\u002Fgetting-started\u002Fprerequisites).\n\n*Note*:\nAlong with the recent repo name change, we're phasing out the nv-ingest APIs and simplifying the dependencies. You can follow this work and see the forward looking API via the [nemo_retriever](nemo_retriever) library subfolder.\n\n\n## Typical Use\n\nFor small-scale workloads, such as workloads of fewer than 100 PDFs, you can use our in development library setup which works with HuggingFace models on local GPUs or with NIMs hosted on build.nvidia.com.\n\nAfter [following the quickstart installation steps](nemo_retriever), you can start ingesting content like with the following snippet:\n```python\nfrom nemo_retriever import create_ingestor\nfrom nemo_retriever.io import to_markdown, to_markdown_by_page\nfrom pathlib import Path\n\ndocuments = [str(Path(\"..\u002Fdata\u002Fmultimodal_test.pdf\"))]\ningestor = create_ingestor(run_mode=\"batch\")\n\n# ingestion tasks are chainable and defined lazily\ningestor = (\n  ingestor.files(documents)\n  .extract(\n    # below are the default values, but content types can be controlled\n    extract_text=True,\n    extract_charts=True,\n    extract_tables=True,\n    extract_infographics=True\n  )\n  .embed()\n  .vdb_upload()\n)\n\n# ingestor.ingest() actually executes the pipeline\n# batch run_mode returns a ray.data.Dataset; inprocess returns a pandas DataFrame\ndataset = ingestor.ingest()\nchunks = dataset.take_all()  # Ray Dataset API (batch mode)\n```\n\nYou can see the extracted text that represents the content of the ingested test document.\n\n```python\n# page 1 raw text:\n>>> chunks[0][\"text\"]\n'TestingDocument\\r\\nA sample document with headings and placeholder text\\r\\nIntroduction\\r\\nThis is a placeholder document that can be used for any purpose...'\n\n# markdown formatted table from the first page\n>>> chunks[1][\"text\"]\n'| Table | 1 |\\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\\n| locations. |\\n| Animal | Activity | Place |\\n| Giraffe | Driving | a | car | At | the | beach |\\n| Lion | Putting | on | sunscreen | At | the | park |\\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\\n| Dog | Chasing | a | squirrel | In | the | front | yard |\\n| Chart | 1 |'\n\n# a chart from the first page\n>>> chunks[2][\"text\"]\n'Chart 1\\nThis chart shows some gadgets, and some very fictitious costs.\\nGadgets and their cost\\n$160.00\\n$140.00\\n$120.00\\n$100.00\\nDollars\\n$80.00\\n$60.00\\n$40.00\\n$20.00\\n$-\\nPowerdrill\\nBluetooth speaker\\nMinifridge\\nPremium desk fan\\nHammer\\nCost'\n\n# markdown formatting for full pages or documents:\n# document results are keyed by source filename\n>>> to_markdown_by_page(chunks).keys()\ndict_keys(['multimodal_test.pdf'])\n\n# results per document are keyed by page number\n>>> to_markdown_by_page(chunks)[\"multimodal_test.pdf\"].keys()\ndict_keys([1, 2, 3])\n\n>>> to_markdown_by_page(chunks)[\"multimodal_test.pdf\"][1]\n'TestingDocument\\r\\nA sample document with headings and placeholder text\\r\\nIntroduction\\r\\nThis is a placeholder document that can be used for any purpose. It contains some \\r\\nheadings and some placeholder text to fill the space. The text is not important and contains \\r\\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \\r\\nthat we can use to confirm Ingest is working as expected.\\r\\nTable 1\\r\\nThis table describes some animals, and some activities they might be doing in specific \\r\\nlocations.\\r\\nAnimal Activity Place\\r\\nGira@e Driving a car At the beach\\r\\nLion Putting on sunscreen At the park\\r\\nCat Jumping onto a laptop In a home o@ice\\r\\nDog Chasing a squirrel In the front yard\\r\\nChart 1\\r\\nThis chart shows some gadgets, and some very fictitious costs.\\n\\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\\n| locations. |\\n| Animal | Activity | Place |\\n| Giraffe | Driving | a | car | At | the | beach |\\n| Lion | Putting | on | sunscreen | At | the | park |\\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\\n| Dog | Chasing | a | squirrel | In | the | front | yard |\\n| Chart | 1 |\\n\\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\\n\\n### Table 1\\n\\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\\n| locations. |\\n| Animal | Activity | Place |\\n| Giraffe | Driving | a | car | At | the | beach |\\n| Lion | Putting | on | sunscreen | At | the | park |\\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\\n| Dog | Chasing | a | squirrel | In | the | front | yard |\\n| Chart | 1 |\\n\\n### Chart 1\\n\\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\\n\\n### Table 2\\n\\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\\n| locations. |\\n| Animal | Activity | Place |\\n| Giraffe | Driving | a | car | At | the | beach |\\n| Lion | Putting | on | sunscreen | At | the | park |\\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\\n| Dog | Chasing | a | squirrel | In | the | front | yard |\\n| Chart | 1 |\\n\\n### Chart 2\\n\\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\\n\\n### Table 3\\n\\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\\n| locations. |\\n| Animal | Activity | Place |\\n| Giraffe | Driving | a | car | At | the | beach |\\n| Lion | Putting | on | sunscreen | At | the | park |\\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\\n| Dog | Chasing | a | squirrel | In | the | front | yard |\\n| Chart | 1 |\\n\\n### Chart 3\\n\\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost'\n\n# full document markdown also keyed by source filename\n>>> to_markdown(chunks).keys()\ndict_keys(['multimodal_test.pdf'])\n```\n\n### Query Ingested Content\n\nTo query for relevant snippets of the ingested content, and use them with an LLM to generate answers, use the following code.\n\n```python\nfrom nemo_retriever.retriever import Retriever\nfrom openai import OpenAI\nimport os\n\nretriever = Retriever()\n\nquery = \"Given their activities, which animal is responsible for the typos in my documents?\"\n\n# you can also submit a list with retriever.queries[...]\nhits = retriever.query(query)\n\nclient = OpenAI(\n  base_url = \"https:\u002F\u002Fintegrate.api.nvidia.com\u002Fv1\",\n  api_key = os.environ.get(\"NVIDIA_API_KEY\")\n)\n\nhit_texts = [hit[\"text\"] for hit in hits]\nprompt = f\"\"\"\nGiven the following retrieved documents, answer the question: {query}\n\nDocuments:\n{hit_texts}\n\"\"\"\n\ncompletion = client.chat.completions.create(\n  model=\"nvidia\u002Fnemotron-3-super-120b-a12b\",\n  messages=[{\"role\":\"user\",\"content\":prompt}],\n  stream=False\n)\n\nanswer = completion.choices[0].message.content\nprint(answer)\n```\n\nAnswer:\n```shell\nCat is the animal whose activity (jumping onto a laptop) matches the location of the typos, so the cat is responsible for the typos in the documents.\n```\n\n> [!TIP]\n> Beyond inspecting the results, you can read them into things like [llama-index](examples\u002Fllama_index_multimodal_rag.ipynb) or [langchain](examples\u002Flangchain_multimodal_rag.ipynb) retrieval pipelines.\n>\n> Please also checkout our [demo using a retrieval pipeline on build.nvidia.com](https:\u002F\u002Fbuild.nvidia.com\u002Fnvidia\u002Fmultimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted w\u002F NVIDIA Ingest.\n\n## Documentation Resources\n\n- **[Official Documentation](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fretriever\u002Fextraction\u002F)** - Complete user guides, API references, and deployment instructions\n- **[Getting Started Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fretriever\u002Fextraction\u002Foverview\u002F)** - Overview and prerequisites for production deployments\n- **[Benchmarking Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fretriever\u002Fextraction\u002Fbenchmarking\u002F)** - Performance testing and recall evaluation framework\n- **[MIG Deployment](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fretriever\u002Fextraction\u002Fmig-benchmarking\u002F)** - Multi-Instance GPU configurations for Kubernetes\n- **[API Documentation](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fretriever\u002Fextraction\u002Fapi\u002F)** - Python client and API reference\n\n## Notices\n\n### Third Party License Notice:\n\nIf configured to do so, this project will download and install additional third-party open source software projects.\nReview the license terms of these open source projects before use:\n\nhttps:\u002F\u002Fpypi.org\u002Fproject\u002Fpdfservices-sdk\u002F\n\n- **`INSTALL_ADOBE_SDK`**:\n  - **Description**: If set to `true`, the Adobe SDK will be installed in the container at launch time. This is\n    required if you want to use the Adobe extraction service for PDF decomposition. Please review the\n    [license agreement](https:\u002F\u002Fgithub.com\u002Fadobe\u002Fpdfservices-python-sdk?tab=License-1-ov-file) for the\n    pdfservices-sdk before enabling this option.\n- **Built With Llama**:\n  - **Description**: The NV-Ingest container comes with the `meta-llama\u002FLlama-3.2-1B` tokenizer pre-downloaded so \n    that the split task can use it for token-based splitting without making a network request. The [Llama 3.2 Community License Agreement](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.2-1B\u002Fblob\u002Fmain\u002FLICENSE.txt) governs your use of these Llama materials.\n    \n    If you're building the container yourself and want to pre-download this model, you'll first need to set \n    `DOWNLOAD_LLAMA_TOKENIZER` to `True`. Because this is a gated model, you'll also need to \n    [request access](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.2-1B) and set `HF_ACCESS_TOKEN` to your HuggingFace \n    access token in order to use it.\n\nBefore contributing to this project, please review our [Contributor Guide](contributing.md).\n\n## Security Considerations\n\n- NeMo Retriever Extraction doesn't generate any code that may require sandboxing.\n- NeMo Retriever Extraction is shared as a reference and is provided \"as is\". The security in the production environment is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, implement logging and monitoring capabilities, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment up to date, ensure the containers\u002Fsource code are secure and free of known vulnerabilities.\n- A frontend that handles AuthN & AuthZ should be in place as missing AuthN & AuthZ could provide ungated access to customer models if directly exposed to e.g. the internet, resulting in either cost to the customer, resource exhaustion, or denial of service.\n- NeMo Retriever Extraction doesn't require any privileged access to the system.\n- The end users are responsible for ensuring the availability of their deployment.\n- The end users are responsible for building the container images and keeping them up to date.\n- The end users are responsible for ensuring that OSS packages used by the developer blueprint are current.\n- The logs from nginx proxy, backend, and demo app are printed to standard out. They can include input prompts and output completions for development purposes. The end users are advised to handle logging securely and avoid information leakage for production use cases.\n","NeMo Retriever Library 是一个可扩展的、面向性能的文档内容和元数据提取微服务框架。它利用NVIDIA NIM微服务及多种模型，实现文本、表格、图表和图像等内容的查找、上下文化与提取，适用于下游生成式应用。该库支持将文档分割成页，并通过光学字符识别技术对其中的元素进行分类和提取，结果以JSON格式输出。此外，NeMo Retriever还能计算提取内容的嵌入并向LanceDB中存储这些信息。对于需要高性能和可扩展性的场景，推荐使用Kubernetes部署相关服务。此项目特别适合处理大规模文档处理任务或构建检索增强的应用程序。",2,"2026-06-11 03:41:35","high_star"]