[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72338":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":22,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},72338,"open-parse","Filimoa\u002Fopen-parse","Filimoa","Improved file parsing for LLM’s","https:\u002F\u002Ffilimoa.github.io\u002Fopen-parse\u002F",null,"Python",3162,141,20,22,0,8,28.46,"MIT License",false,"main",true,[24,25,26,27],"document-parser","document-structure","layout-parsing","table-detection","2026-06-12 02:03:01","\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fsergey-filimonov.nyc3.digitaloceanspaces.com\u002Fopen-parse\u002Fopen-parse-with-text-tp-logo.webp\" width=\"350\" \u002F>\n\u003C\u002Fp>\n\u003Cbr\u002F>\n\n**Easily chunk complex documents the same way a human would.**  \n\nChunking documents is a challenging task that underpins any RAG system.  High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.  \n\nOpen Parse is designed to fill this gap by providing a flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively.\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>How is this different from other layout parsers?\u003C\u002Fb>\u003C\u002Fsummary>\n\n  #### ✂️ Text Splitting\n  Text splitting converts a file to raw text and [slices it up](https:\u002F\u002Fdocs.llamaindex.ai\u002Fen\u002Fstable\u002Fapi_reference\u002Fnode_parsers\u002Ftoken_text_splitter\u002F).\n  \n  - You lose the ability to easily overlay the chunk on the original pdf\n  - You ignore the underlying semantic structure of the file - headings, sections, bullets represent valuable information.\n  - No support for tables, images or markdown.\n  \n  #### 🤖 ML Layout Parsers\n  There's some of fantastic libraries like [layout-parser](https:\u002F\u002Fgithub.com\u002FLayout-Parser\u002Flayout-parser). \n  - While they can identify various elements like text blocks, images, and tables, but they are not built to group related content effectively.\n  - They strictly focus on layout parsing - you will need to add another model to extract markdown from the images, parse tables, group nodes, etc.\n  - We've found performance to be sub-optimal on many documents while also being computationally heavy.\n\n  #### 💼 Commercial Solutions\n\n  - Typically priced at ≈ $10 \u002F 1k pages. See [here](https:\u002F\u002Fcloud.google.com\u002Fdocument-ai), [here](https:\u002F\u002Faws.amazon.com\u002Ftextract\u002F) and [here](https:\u002F\u002Fwww.reducto.ai\u002F).\n  - Requires sharing your data with a vendor\n\n\u003C\u002Fdetails>\n\n## Highlights\n\n- **🔍 Visually-Driven:** Open-Parse visually analyzes documents for superior LLM input, going beyond naive text splitting.\n- **✍️ Markdown Support:** Basic markdown support for parsing headings, bold and italics.\n- **📊 High-Precision Table Support:** Extract tables into clean Markdown formats with accuracy that surpasses traditional tools.\n    \u003Cdetails>\n  \u003Csummary>\u003Ci>Examples\u003C\u002Fi>\u003C\u002Fsummary>\n  The following examples were parsed with unitable.\n    \u003Cbr\u002F>\n    \u003Cp align=\"center\">\n        \u003Cbr\u002F>\n        \u003Cimg src=\"https:\u002F\u002Fsergey-filimonov.nyc3.digitaloceanspaces.com\u002Fopen-parse\u002Funitable-parsing-sample.webp\" width=\"650\"\u002F>\n    \u003C\u002Fp>\n         \u003Cbr\u002F>\n    \u003C\u002Fdetails>\n\n- **🛠️ Extensible:** Easily implement your own post-processing steps.\n- **💡Intuitive:** Great editor support. Completion everywhere. Less time debugging.\n- **🎯 Easy:** Designed to be easy to use and learn. Less time reading docs.\n\n\u003Cbr\u002F>\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fsergey-filimonov.nyc3.digitaloceanspaces.com\u002Fopen-parse\u002Fmarked-up-doc-2.webp\" width=\"250\" \u002F>\n\u003C\u002Fp>\n\n## Example\n\n#### Basic Example\n\n```python\nimport openparse\n\nbasic_doc_path = \".\u002Fsample-docs\u002Fmobile-home-manual.pdf\"\nparser = openparse.DocumentParser()\nparsed_basic_doc = parser.parse(basic_doc_path)\n\nfor node in parsed_basic_doc.nodes:\n    print(node)\n```\n\n**📓 Try the sample notebook** \u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1Z5B5gsnmhFKEFL-5yYIcoox7-jQao8Ep?usp=sharing\" class=\"external-link\" target=\"_blank\">here\u003C\u002Fa>\n\n#### Semantic Processing Example\n\nChunking documents is fundamentally about grouping similar semantic nodes together. By embedding the text of each node, we can then cluster them together based on their similarity.\n\n```python\nfrom openparse import processing, DocumentParser\n\nsemantic_pipeline = processing.SemanticIngestionPipeline(\n    openai_api_key=OPEN_AI_KEY,\n    model=\"text-embedding-3-large\",\n    min_tokens=64,\n    max_tokens=1024,\n)\nparser = DocumentParser(\n    processing_pipeline=semantic_pipeline,\n)\nparsed_content = parser.parse(basic_doc_path)\n```\n\n**📓 Sample notebook** \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FFilimoa\u002Fopen-parse\u002Fblob\u002Fmain\u002Fsrc\u002Fcookbooks\u002Fsemantic_processing.ipynb\" class=\"external-link\" target=\"_blank\">here\u003C\u002Fa>\n\n#### Serializing Results\nUses pydantic under the hood so you can serialize results with \n\n```python\nparsed_content.dict()\n\n# or to convert to a valid json dict\nparsed_content.json()\n```\n\n## Requirements\n\nPython 3.8+\n\n**Dealing with PDF's:**\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fpdfminer\u002Fpdfminer.six\" class=\"external-link\" target=\"_blank\">pdfminer.six\u003C\u002Fa> Fully open source.\n\n**Extracting Tables:**\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fpymupdf\u002FPyMuPDF\" class=\"external-link\" target=\"_blank\">PyMuPDF\u003C\u002Fa> has some table detection functionality. Please see their \u003Ca href=\"https:\u002F\u002Fmupdf.com\u002Flicensing\u002Findex.html#commercial\" class=\"external-link\" target=\"_blank\">license\u003C\u002Fa>.\n- \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Ftable-transformer-detection\" class=\"external-link\" target=\"_blank\">Table Transformer\u003C\u002Fa> is a deep learning approach.\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fpoloclub\u002Funitable\" class=\"external-link\" target=\"_blank\">unitable\u003C\u002Fa> is another transformers based approach with **state-of-the-art** performance.\n\n## Installation\n\n#### 1. Core Library\n\n```console\npip install openparse\n```\n\n**Enabling OCR Support**:\n\nPyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required.\n\nThe language support folder location must be communicated either via storing it in the environment variable \"TESSDATA_PREFIX\", or as a parameter in the applicable functions.\n\nSo for a working OCR functionality, make sure to complete this checklist:\n\n1. Install Tesseract.\n\n2. Locate Tesseract’s language support folder. Typically you will find it here:\n\n   - Windows: `C:\u002FProgram Files\u002FTesseract-OCR\u002Ftessdata`\n\n   - Unix systems: `\u002Fusr\u002Fshare\u002Ftesseract-ocr\u002F5\u002Ftessdata`\n\n   - macOS (installed via Homebrew):\n     - Standard installation: `\u002Fopt\u002Fhomebrew\u002Fshare\u002Ftessdata`\n     - Version-specific installation: `\u002Fopt\u002Fhomebrew\u002FCellar\u002Ftesseract\u002F\u003Cversion>\u002Fshare\u002Ftessdata\u002F`\n\n3. Set the environment variable TESSDATA_PREFIX\n\n   - Windows: `setx TESSDATA_PREFIX \"C:\u002FProgram Files\u002FTesseract-OCR\u002Ftessdata\"`\n\n   - Unix systems: `declare -x TESSDATA_PREFIX=\u002Fusr\u002Fshare\u002Ftesseract-ocr\u002F5\u002Ftessdata`\n\n    - macOS (installed via Homebrew): `export TESSDATA_PREFIX=$(brew --prefix tesseract)\u002Fshare\u002Ftessdata`\n\n**Note:** _On Windows systems, this must happen outside Python – before starting your script. Just manipulating os.environ will not work!_\n\n#### 2. ML Table Detection (Optional)\n\nThis repository provides an optional feature to parse content from tables using a variety of deep learning models.\n\n```console\npip install \"openparse[ml]\"\n```\n\nThen download the model weights with\n\n```console\nopenparse-download\n```\n\nYou can run the parsing with the following. \n\n```python\nparser = openparse.DocumentParser(\n        table_args={\n            \"parsing_algorithm\": \"unitable\",\n            \"min_table_confidence\": 0.8,\n        },\n)\nparsed_nodes = parser.parse(pdf_path)\n```\n\nNote we currently use [table-transformers](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftable-transformer) for all table detection and we find its performance to be subpar. This negatively affects the downstream results of unitable. If you're aware of a better model please open an Issue - the unitable team mentioned they might add this soon too.\n\n## Cookbooks\n\nhttps:\u002F\u002Fgithub.com\u002FFilimoa\u002Fopen-parse\u002Ftree\u002Fmain\u002Fsrc\u002Fcookbooks\n\n## Documentation\n\nhttps:\u002F\u002Ffilimoa.github.io\u002Fopen-parse\u002F\n","Open Parse 是一个用于提高大语言模型文件解析能力的工具。它通过视觉分析文档布局，提供高质量的文本块分割、基本Markdown支持及高精度表格提取等功能，超越了传统的纯文本分割方法。该库具有良好的扩展性，允许用户轻松添加自处理步骤，并且设计上注重易用性和直观性，减少开发者的调试时间。适用于需要对复杂文档进行高效准确解析的场景，如构建基于检索增强生成（RAG）系统的应用。项目采用Python编写，遵循MIT许可协议。",2,"2026-06-11 03:41:25","high_star"]