[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71962":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},71962,"PDF-Extract-Kit","opendatalab\u002FPDF-Extract-Kit","opendatalab","A Comprehensive Toolkit for High-Quality PDF Content Extraction","https:\u002F\u002Fpdf-extract-kit.readthedocs.io\u002Fzh-cn\u002Flatest\u002Findex.html",null,"Python",9714,732,61,96,0,14,31,60,42,39.6,"GNU Affero General Public License v3.0",false,"main",true,[],"2026-06-12 02:02:56","\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Freadme\u002Fpdf-extract-kit_logo.png\" width=\"220px\" style=\"vertical-align:middle;\">\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n\nEnglish | [简体中文](.\u002FREADME_zh-CN.md)\n\n[PDF-Extract-Kit-1.0 Tutorial](https:\u002F\u002Fpdf-extract-kit.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fpretrained_model.html)\n\n[[Models (🤗Hugging Face)]](https:\u002F\u002Fhuggingface.co\u002Fopendatalab\u002FPDF-Extract-Kit-1.0) | [[Models(\u003Cimg src=\".\u002Fassets\u002Freadme\u002Fmodelscope_logo.png\" width=\"20px\">ModelScope)]](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenDataLab\u002FPDF-Extract-Kit-1.0) \n \n🔥🔥🔥 [MinerU: Efficient Document Content Extraction Tool Based on PDF-Extract-Kit](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU)\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n    👋 join us on \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FTdedn9GTXq\" target=\"_blank\">Discord\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fr.vansin.top\u002F?r=MinerU\" target=\"_blank\">WeChat\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n## Overview\n\n`PDF-Extract-Kit` is a powerful open-source toolkit designed to efficiently extract high-quality content from complex and diverse PDF documents. Here are its main features and advantages:\n\n- **Integration of Leading Document Parsing Models**: Incorporates state-of-the-art models for layout detection, formula detection, formula recognition, OCR, and other core document parsing tasks.\n- **High-Quality Parsing Across Diverse Documents**: Fine-tuned with diverse document annotation data to deliver high-quality results across various complex document types.\n- **Modular Design**: The flexible modular design allows users to easily combine and construct various applications by modifying configuration files and minimal code, making application building as straightforward as stacking blocks.\n- **Comprehensive Evaluation Benchmarks**: Provides diverse and comprehensive PDF evaluation benchmarks, enabling users to choose the most suitable model based on evaluation results.\n\n**Experience PDF-Extract-Kit now and unlock the limitless potential of PDF documents!**\n\n> **Note:** PDF-Extract-Kit is designed for high-quality document processing and functions as a model toolbox.    \n> If you are interested in extracting high-quality document content (e.g., converting PDFs to Markdown), please use [MinerU](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU), which combines the high-quality predictions from PDF-Extract-Kit with specialized engineering optimizations for more convenient and efficient content extraction.    \n> If you're a developer looking to create engaging applications such as document translation, document Q&A, or document assistants, you'll find it very convenient to build your own projects using PDF-Extract-Kit. In particular, we will periodically update the PDF-Extract-Kit\u002Fproject directory with interesting applications, so stay tuned!\n\n**We welcome researchers and engineers from the community to contribute outstanding models and innovative applications by submitting PRs to become contributors to the PDF-Extract-Kit project.**\n\n## Model Overview\n\n| **Task Type**     | **Description**                                                                 | **Models**                    |\n|-------------------|---------------------------------------------------------------------------------|-------------------------------|\n| **Layout Detection** | Locate different elements in a document: including images, tables, text, titles, formulas | `DocLayout-YOLO_ft`, `YOLO-v10_ft`, `LayoutLMv3_ft` | \n| **Formula Detection** | Locate formulas in documents: including inline and block formulas            | `YOLOv8_ft`                   |  \n| **Formula Recognition** | Recognize formula images into LaTeX source code                             | `UniMERNet`                   |  \n| **OCR**           | Extract text content from images (including location and recognition)            | `PaddleOCR`                   | \n| **Table Recognition** | Recognize table images into corresponding source code (LaTeX\u002FHTML\u002FMarkdown)   | `PaddleOCR+TableMaster`, `StructEqTable` |  \n| **Reading Order** | Sort and concatenate discrete text paragraphs                                    | Coming Soon!                  | \n\n## News and Updates\n- `2024.10.22` 🎉🎉🎉 We are excited to announce that table recognition model [StructTable-InternVL2-1B](https:\u002F\u002Fhuggingface.co\u002FU4R\u002FStructTable-InternVL2-1B), which supports output LaTeX, HTML and MarkdDown formats has been officially integrated into `PDF-Extract-Kit 1.0`. Please refer to the [table recognition algorithm documentation](https:\u002F\u002Fpdf-extract-kit.readthedocs.io\u002Fen\u002Flatest\u002Falgorithm\u002Ftable_recognition.html) for usage instructions!\n- `2024.10.17` 🎉🎉🎉 We are excited to announce that the more accurate and faster layout detection model, [DocLayout-YOLO](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO), has been officially integrated into `PDF-Extract-Kit 1.0`. Please refer to the [layout detection algorithm documentation](https:\u002F\u002Fpdf-extract-kit.readthedocs.io\u002Fen\u002Flatest\u002Falgorithm\u002Flayout_detection.html) for usage instructions!\n- `2024.10.10` 🎉🎉🎉 The official release of `PDF-Extract-Kit 1.0`, rebuilt with modularity for more convenient and flexible model usage! Please switch to the [release\u002F0.1.1](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FPDF-Extract-Kit\u002Ftree\u002Frelease\u002F0.1.1) branch for the old version.\n- `2024.08.01` 🎉🎉🎉 Added the [StructEqTable](demo\u002FTabRec\u002FStructEqTable\u002FREADME_TABLE.md) module for table content extraction. Welcome to use it!\n- `2024.07.01` 🎉🎉🎉 We released `PDF-Extract-Kit`, a comprehensive toolkit for high-quality PDF content extraction, including `Layout Detection`, `Formula Detection`, `Formula Recognition`, and `OCR`.\n\n## Performance Demonstration\n\nMany current open-source SOTA models are trained and evaluated on academic datasets, achieving high-quality results only on single document types. To enable models to achieve stable and robust high-quality results on diverse documents, we constructed diverse fine-tuning datasets and fine-tuned some SOTA models to obtain practical parsing models. Below are some visual results of the models.\n\n### Layout Detection\n\nWe trained robust `Layout Detection` models using diverse PDF document annotations. Our fine-tuned models achieve accurate extraction results on diverse PDF documents such as papers, textbooks, research reports, and financial reports, and demonstrate high robustness to challenges like blurring and watermarks. The visualization example below shows the inference results of the fine-tuned LayoutLMv3 model.\n \n![](assets\u002Freadme\u002Flayout_example.png)\n\n### Formula Detection\n\nSimilarly, we collected and annotated documents containing formulas in both English and Chinese, and fine-tuned advanced formula detection models. The visualization result below shows the inference results of the fine-tuned YOLO formula detection model:\n\n![](assets\u002Freadme\u002Fmfd_example.png)\n\n### Formula Recognition\n\n[UniMERNet](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FUniMERNet) is an algorithm designed for diverse formula recognition in real-world scenarios. By constructing large-scale training data and carefully designed results, it achieves excellent recognition performance for complex long formulas, handwritten formulas, and noisy screenshot formulas.\n\n### Table Recognition\n\n[StructEqTable](https:\u002F\u002Fgithub.com\u002FUniModal4Reasoning\u002FStructEqTable-Deploy) is a high efficiency toolkit that can converts table images into LaTeX\u002FHTML\u002FMarkDown. The latest version, powered by the InternVL2-1B foundation model,  improves Chinese recognition accuracy and expands multi-format output options.\n\n#### For more visual and inference results of the models, please refer to the [PDF-Extract-Kit tutorial documentation](xxx).\n\n## Evaluation Metrics\n\nComing Soon!\n\n## Usage Guide\n\n### Environment Setup\n\n```bash\nconda create -n pdf-extract-kit-1.0 python=3.10\nconda activate pdf-extract-kit-1.0\npip install -r requirements.txt\n```\n> **Note:** If your device does not support GPU, please install the CPU version dependencies using `requirements-cpu.txt` instead of `requirements.txt`.\n\n> **Note：** Current Doclayout-YOLO only supports installation from pypi，if error raises during DocLayout-YOLO installation，please install through `pip3 install doclayout-yolo==0.0.2 --extra-index-url=https:\u002F\u002Fpypi.org\u002Fsimple` .\n\n### Model Download\n\nPlease refer to the [Model Weights Download Tutorial](https:\u002F\u002Fpdf-extract-kit.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fpretrained_model.html) to download the required model weights. Note: You can choose to download all the weights or select specific ones. For detailed instructions, please refer to the tutorial.\n\n### Running Demos\n\n#### Layout Detection Model\n\n```bash \npython scripts\u002Flayout_detection.py --config=configs\u002Flayout_detection.yaml\n```\nLayout detection models support **DocLayout-YOLO** (default model), YOLO-v10, and LayoutLMv3. For YOLO-v10 and LayoutLMv3, please refer to [Layout Detection Algorithm](https:\u002F\u002Fpdf-extract-kit.readthedocs.io\u002Fen\u002Flatest\u002Falgorithm\u002Flayout_detection.html). You can view the layout detection results in the `outputs\u002Flayout_detection` folder.\n\n#### Formula Detection Model\n\n```bash \npython scripts\u002Fformula_detection.py --config=configs\u002Fformula_detection.yaml\n```\nYou can view the formula detection results in the `outputs\u002Fformula_detection` folder.\n\n#### OCR Model\n\n```bash \npython scripts\u002Focr.py --config=configs\u002Focr.yaml\n```\nYou can view the OCR results in the `outputs\u002Focr` folder.\n\n#### Formula Recognition Model\n\n```bash \npython scripts\u002Fformula_recognition.py --config=configs\u002Fformula_recognition.yaml\n```\nYou can view the formula recognition results in the `outputs\u002Fformula_recognition` folder.\n\n#### Table Recognition Model\n\n```bash \npython scripts\u002Ftable_parsing.py --config configs\u002Ftable_parsing.yaml\n```\nYou can view the table recognition results in the `outputs\u002Ftable_parsing` folder.\n\n> **Note:** For more details on using the model, please refer to the[PDF-Extract-Kit-1.0 Tutorial](https:\u002F\u002Fpdf-extract-kit.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fpretrained_model.html).\n\n> This project focuses on using models for `high-quality` content extraction from `diverse` documents and does not involve reconstructing extracted content into new documents, such as PDF to Markdown. For such needs, please refer to our other GitHub project: [MinerU](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU).\n\n## To-Do List\n\n- [x] **Table Parsing**: Develop functionality to convert table images into corresponding LaTeX\u002FMarkdown format source code.\n- [ ] **Chemical Equation Detection**: Implement automatic detection of chemical equations.\n- [ ] **Chemical Equation\u002FDiagram Recognition**: Develop models to recognize and parse chemical equations and diagrams.\n- [ ] **Reading Order Sorting Model**: Build a model to determine the correct reading order of text in documents.\n\n**PDF-Extract-Kit** aims to provide high-quality PDF content extraction capabilities. We encourage the community to propose specific and valuable needs and welcome everyone to participate in continuously improving the PDF-Extract-Kit tool to advance research and industry development.\n\n## License\n\nThis project is open-sourced under the [AGPL-3.0](LICENSE) license.\n\nSince this project uses YOLO code and PyMuPDF for file processing, these components require compliance with the AGPL-3.0 license. Therefore, to ensure adherence to the licensing requirements of these dependencies, this repository as a whole adopts the AGPL-3.0 license.\n\n## Acknowledgement\n\n   - [LayoutLMv3](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Flayoutlmv3): Layout detection model\n   - [UniMERNet](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FUniMERNet): Formula recognition model\n   - [StructEqTable](https:\u002F\u002Fgithub.com\u002FUniModal4Reasoning\u002FStructEqTable-Deploy): Table recognition model\n   - [YOLO](https:\u002F\u002Fgithub.com\u002Fultralytics\u002Fultralytics): Formula detection model\n   - [PaddleOCR](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleOCR): OCR model\n   - [DocLayout-YOLO](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO): Layout detection model\n\n## Citation\nIf you find our models \u002F code \u002F papers useful in your research, please consider giving ⭐ and citations 📝, thx :)  \n```bibtex\n@article{wang2024mineru,\n  title={MinerU: An Open-Source Solution for Precise Document Content Extraction},\n  author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},\n  journal={arXiv preprint arXiv:2409.18839},\n  year={2024}\n}\n\n@misc{zhao2024doclayoutyoloenhancingdocumentlayout,\n      title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception}, \n      author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He},\n      year={2024},\n      eprint={2410.12628},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628}, \n}\n\n@misc{wang2024unimernet,\n      title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, \n      author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He},\n      year={2024},\n      eprint={2404.15254},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n\n@article{he2024opendatalab,\n  title={Opendatalab: Empowering general artificial intelligence with open datasets},\n  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},\n  journal={arXiv preprint arXiv:2407.13773},\n  year={2024}\n}\n```\n\n## Star History\n\n\u003Ca>\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=opendatalab\u002FPDF-Extract-Kit&type=Date&theme=dark\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=opendatalab\u002FPDF-Extract-Kit&type=Date\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=opendatalab\u002FPDF-Extract-Kit&type=Date\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n## Related Links\n- [UniMERNet (Real-World Formula Recognition Algorithm)](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FUniMERNet)\n- [LabelU (Lightweight Multimodal Annotation Tool)](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FlabelU)\n- [LabelLLM (Open Source LLM Dialogue Annotation Platform)](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FLabelLLM)\n- [MinerU (One-Stop High-Quality Data Extraction Tool)](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU)\n","PDF-Extract-Kit 是一个用于高效提取复杂多样的PDF文档内容的开源工具包。它集成了领先的文档解析模型，包括布局检测、公式检测与识别及OCR等核心任务，确保了对不同类型文档的高质量解析。该工具包采用模块化设计，用户通过修改配置文件和少量代码即可轻松构建各种应用，极大简化了开发流程。此外，PDF-Extract-Kit 提供了全面的评估基准，帮助用户根据实际需求选择最合适的模型。此项目适用于需要高质量文档处理的场景，如文档内容提取（转换PDF为Markdown）、文档翻译、文档问答系统或文档助手应用的开发。",2,"2026-06-11 03:39:41","high_star"]