[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72618":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":44,"readmeContent":45,"aiSummary":46,"trendingCount":15,"starSnapshotCount":15,"syncStatus":47,"lastSyncTime":48,"discoverSource":49},72618,"docext","NanoNets\u002Fdocext","NanoNets","An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https:\u002F\u002Fidp-leaderboard.org\u002F)","https:\u002F\u002Fnanonets.com\u002Fdocument-parsing-and-extraction",null,"Python",2021,144,20,0,4,18,28.48,"Apache License 2.0",false,"main",true,[24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43],"document","document-analysis","document-data-extraction","document-information-extraction","extraction","llm-ocr","llms","machine-learning","nlp","ocr","ocr-benchmark","ocr-onpremise","onprem","onprem-ocr","onprem-vision","onpremise","rag","table-extraction","unstructured-data","vlms","2026-06-12 02:03:05","\u003Ch1 align=\"center\">docext\u003C\u002Fh1>\n\n\n\u003Cp align=\"center\">\u003Cem>An on-premises document information extraction and benchmarking toolkit.\u003C\u002Fem>\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fprojects\u002Fdocext\">\n    \u003Cimg src=\"https:\u002F\u002Fstatic.pepy.tech\u002Fbadge\u002Fdocext\" alt=\"PyPI Downloads\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg\" alt=\"License\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1r1asxGeezfWnJvw8jimfFAB2sGjk1HdM?usp=sharing\">\n    \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fdocext\u002F\">\n    \u003Cimg alt=\"PyPI - Version\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fdocext\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003C!-- ![Demo Docext](https:\u002F\u002Fraw.githubusercontent.com\u002FNanoNets\u002Fdocext\u002Fmain\u002Fassets\u002Fpdf2markdown.jpg) -->\n![Demo Docext](assets\u002Fpdf2markdown.png)\n\n\n## New Model Release: Nanonets-OCR-s\n\n**We're excited to announce the release of Nanonets-OCR-s, a compact 3B parameter model specifically trained for efficient image to markdown conversion with semantic understanding for images, signatures, watermarks, etc.!**\n\n  📢 [Read the full announcement](https:\u002F\u002Fnanonets.com\u002Fresearch\u002Fnanonets-ocr-s) | 🤗 [Hugging Face model](https:\u002F\u002Fhuggingface.co\u002Fnanonets\u002FNanonets-OCR-s)\n\n## Overview\n\ndocext is a comprehensive on-premises document intelligence toolkit powered by vision-language models (VLMs). It provides three core capabilities:\n\n**📄 PDF & Image to Markdown Conversion**: Transform documents into structured markdown with intelligent content recognition, including LaTeX equations, signatures, watermarks, tables, and semantic tagging.\n\n**🔍 Document Information Extraction**: OCR-free extraction of structured information (fields, tables, etc.) from documents such as invoices, passports, and other document types, with confidence scoring.\n\n**📊 Intelligent Document Processing Leaderboard**: A comprehensive benchmarking platform that tracks and evaluates vision-language model performance across OCR, Key Information Extraction (KIE), document classification, table extraction, and other intelligent document processing tasks.\n\n\n## Features\n### PDF and Image to Markdown\nConvert both PDF and images to markdown with content recognition and semantic tagging.\n- **LaTeX Equation Recognition**: Convert both inline and block LaTeX equations in images to markdown.\n- **Intelligent Image Description**: Generate a detailed description for all images in the document within `\u003Cimg>\u003C\u002Fimg>` tags.\n- **Signature Detection**: Detect and mark signatures and watermarks in the document. Signatures text are extracted within `\u003Csignature>\u003C\u002Fsignature>` tags.\n- **Watermark Detection**: Detect and mark watermarks in the document. Watermarks text are extracted within `\u003Cwatermark>\u003C\u002Fwatermark>` tags.\n- **Page Number Detection**: Detect and mark page numbers in the document. Page numbers are extracted within `\u003Cpage_number>\u003C\u002Fpage_number>` tags.\n- **Checkboxes and Radio Buttons**: Converts form checkboxes and radio buttons into standardized Unicode symbols (☐, ☑, ☒).\n- **Table Detection**: Convert complex tables into html tables.\n\n🔍 For in-depth information, see the [release blog](https:\u002F\u002Fnanonets.com\u002Fresearch\u002Fnanonets-ocr-s\u002F).\n\nFor setup instructions and additional details, check out the full feature guide for the [pdf to markdown](https:\u002F\u002Fgithub.com\u002FNanoNets\u002Fdocext\u002Fblob\u002Fmain\u002FPDF2MD_README.md).\n\n### Intelligent Document Processing Leaderboard\nThis benchmark evaluates performance across seven key document intelligence challenges:\n\n- **Key Information Extraction (KIE)**: Extract structured fields from unstructured document text.\n- **Visual Question Answering (VQA)**: Assess understanding of document content via question-answering.\n- **Optical Character Recognition (OCR)**: Measure accuracy in recognizing printed and handwritten text.\n- **Document Classification**: Evaluate how accurately models categorize various document types.\n- **Long Document Processing**: Test models' reasoning over lengthy, context-rich documents.\n- **Table Extraction**: Benchmark structured data extraction from complex tabular formats.\n- **Confidence Score Calibration**: Evaluate the reliability and confidence of model predictions.\n\n🔍 For in-depth information, see the [release blog](https:\u002F\u002Fidp-leaderboard.org\u002Fdetails\u002F).\n\n📊 **Live leaderboard:** [https:\u002F\u002Fidp-leaderboard.org](https:\u002F\u002Fidp-leaderboard.org)\n\nFor setup instructions and additional details, check out the full feature guide for the [Intelligent Document Processing Leaderboard](https:\u002F\u002Fgithub.com\u002FNanoNets\u002Fdocext\u002Ftree\u002Fmain\u002Fdocext\u002Fbenchmark).\n\n### Docext\n- **Flexible extraction**: Define custom fields or use pre-built templates\n- **Table extraction**: Extract structured tabular data from documents\n- **Confidence scoring**: Get confidence levels for extracted information\n- **On-premises deployment**: Run entirely on your own infrastructure (Linux, MacOS)\n- **Multi-page support**: Process documents with multiple pages\n- **REST API**: Programmatic access for integration with your applications\n- **Pre-built templates**: Ready-to-use templates for common document types:\n  - Invoices\n  - Passports\n  - Add\u002Fdelete new fields\u002Fcolumns for other templates.\n\nFor more details (Installation, Usage, and so on), please check out the [feature guide](https:\u002F\u002Fgithub.com\u002FNanoNets\u002Fdocext\u002Fblob\u002Fmain\u002FEXT_README.md).\n\n## Change Log\n\n### Latest Updates\n- **12-06-2025** - Added pdf and image to markdown support.\n- **06-06-2025** - Added `gemini-2.5-pro-preview-06-05` evaluation metrics to the leaderboard.\n- **04-06-2025** - Added support for PDF and multiple documents in `docext` extraction.\n\n\u003Cdetails>\n\u003Csummary>Older Changes\u003C\u002Fsummary>\n\n- **23-05-2025** – Added `gemini-2.5-pro-preview-03-25`, `claude-sonnet-4` evaluation metrics to the leaderboard.\n- **17-05-2025** – Added `InternVL3-38B-Instruct`, `qwen2.5-vl-32b-instruct` evaluation metrics to the leaderboard.\n- **16-05-2025** – Added `gemma-3-27b-it` evaluation metrics to the leaderboard.\n- **12-05-2025** – Added `Claude 3.7 sonnet`, `mistral-medium-3` evaluation metrics to the leaderboard.\n\u003C\u002Fdetails>\n\n## About\n\ndocext is developed by [Nanonets](https:\u002F\u002Fnanonets.com\u002Fdocument-parsing-and-extraction), a leader in document AI and intelligent document processing solutions. Nanonets is committed to advancing the field of document understanding through open-source contributions and innovative AI technologies. If you are looking for information extraction solutions for your business, please visit [our website](https:\u002F\u002Fnanonets.com\u002Fdocument-parsing-and-extraction) to learn more.\n\n## Contributing\n\nWe welcome contributions! Please see [contribution.md](https:\u002F\u002Fgithub.com\u002FNanoNets\u002Fdocext\u002Fblob\u002Fmain\u002Fcontribution.md) for guidelines.\nIf you have a feature request or need support for a new model, feel free to open an issue—we'd love to discuss it further!\n\n## Troubleshooting\n\nIf you encounter any issues while using `docext`, please refer to our [Troubleshooting guide](https:\u002F\u002Fgithub.com\u002FNanoNets\u002Fdocext\u002Fblob\u002Fmain\u002FTroubleshooting.md) for common problems and solutions.\n\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the LICENSE file for details.\n","docext 是一个基于本地部署的文档信息提取和基准测试工具包，不依赖OCR技术。它主要利用视觉-语言模型（VLMs）实现从PDF和图像到Markdown的转换，包括LaTeX公式、签名、水印等元素的智能识别与标记，并支持结构化信息如表格、字段等内容的无OCR提取，同时提供置信度评分。此外，该项目还维护了一个全面的性能评估平台，用于跟踪和评估各种文档处理任务中的模型表现。适用于需要对非结构化数据进行高效转换和分析的场景，如企业文档管理、学术论文整理等。",2,"2026-06-11 03:42:49","high_star"]