[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72267":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":27,"discoverSource":28},72267,"gptpdf","CosmosShadow\u002Fgptpdf","CosmosShadow","Using GPT to parse PDF",null,"Python",3555,264,12,14,0,2,7,6,29.27,"MIT License",false,"main",[],"2026-06-12 02:03:00","# gptpdf\n\n\u003Cp align=\"center\">\n\u003Ca href=\"README_CN.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F文档-中文版-blue.svg\" alt=\"CN doc\">\u003C\u002Fa>\n\u003Ca href=\"README.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocument-English-blue.svg\" alt=\"EN doc\">\u003C\u002Fa>\n\u003C\u002Fp>\n\nUsing VLLM (like GPT-4o) to parse PDF into markdown.\n\nOur approach is very simple (only 293 lines of code), but can almost perfectly parse typography, math formulas, tables, pictures, charts, etc.\n\nAverage cost per page: $0.013\n\nThis package use [GeneralAgent](https:\u002F\u002Fgithub.com\u002FCosmosShadow\u002FGeneralAgent) lib to interact with OpenAI API.\n\n[pdfgpt-ui](https:\u002F\u002Fgithub.com\u002Fdaodao97\u002Fgptpdf-ui) is a visual tool based on gptpdf.\n\n\n\n## Process steps\n\n1. Use the PyMuPDF library to parse the PDF to find all non-text areas and mark them, for example:\n\n![](docs\u002Fdemo.jpg)\n\n2. Use a large visual model (such as GPT-4o) to parse and get a markdown file.\n\n\n\n## DEMO\n\n1. [examples\u002Fattention_is_all_you_need\u002Foutput.md](examples\u002Fattention_is_all_you_need\u002Foutput.md) for PDF [examples\u002Fattention_is_all_you_need.pdf](examples\u002Fattention_is_all_you_need.pdf).\n\n\n2. [examples\u002Frh\u002Foutput.md](examples\u002Frh\u002Foutput.md) for PDF [examples\u002Frh.pdf](examples\u002Frh.pdf).\n\n\n## Installation\n\n```bash\npip install gptpdf\n```\n\n\n\n## Usage\n\n### Local Usage\n\n```python\nfrom gptpdf import parse_pdf\napi_key = 'Your OpenAI API Key'\ncontent, image_paths = parse_pdf(pdf_path, api_key=api_key)\nprint(content)\n```\n\nSee more in [test\u002Ftest.py](test\u002Ftest.py)\n\n\n\n### Google Colab\n\nsee [examples\u002Fgptpdf_Quick_Tour.ipynb](examples\u002Fgptpdf_Quick_Tour.ipynb)\n\n\n\n\n## API\n\n### parse_pdf\n\n**Function**: \n```\ndef parse_pdf(\n        pdf_path: str,\n        output_dir: str = '.\u002F',\n        prompt: Optional[Dict] = None,\n        api_key: Optional[str] = None,\n        base_url: Optional[str] = None,\n        model: str = 'gpt-4o',\n        verbose: bool = False,\n        gpt_worker: int = 1\n) -> Tuple[str, List[str]]:\n```\n\nParses a PDF file into a Markdown file and returns the Markdown content along with all image paths.\n\n**Parameters**:\n\n- **pdf_path**: *str*  \n  Path to the PDF file\n\n- **output_dir**: *str*, default: '.\u002F'  \n  Output directory to store all images and the Markdown file\n\n- **api_key**: *Optional[str]*, optional  \n  OpenAI API key. If not provided, the `OPENAI_API_KEY` environment variable will be used.\n\n- **base_url**: *Optional[str]*, optional  \n  OpenAI base URL. If not provided, the `OPENAI_BASE_URL` environment variable will be used. This can be modified to call other large model services with OpenAI API interfaces, such as `GLM-4V`.\n\n- **model**: *str*, default: 'gpt-4o'  \n  OpenAI API formatted multimodal large model. If you need to use other models, such as:\n  - [qwen-vl-max](https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fdashscope\u002Fdeveloper-reference\u002Fcompatibility-of-openai-with-dashscope) \n  - [GLM-4V](https:\u002F\u002Fopen.bigmodel.cn\u002Fdev\u002Fapi#glm-4v)\n  - [Yi-Vision](https:\u002F\u002Fplatform.lingyiwanwu.com\u002Fdocs) \n  - Azure OpenAI, by setting the `base_url` to `https:\u002F\u002Fxxxx.openai.azure.com\u002F` to use Azure OpenAI, where `api_key` is the Azure API key, and the model is similar to `azure_xxxx`, where `xxxx` is the deployed model name (tested).\n\n- **verbose**: *bool*, default: False  \n  Verbose mode. When enabled, the content parsed by the large model will be displayed in the command line.\n\n- **gpt_worker**: *int*, default: 1  \n  Number of GPT parsing worker threads. If your machine has better performance, you can increase this value to speed up the parsing.\n\n- **prompt**: *dict*, optional  \n  If the model you are using does not match the default prompt provided in this repository and cannot achieve the best results, we support adding custom prompts. The prompts in the repository are divided into three parts:\n  - `prompt`: Mainly used to guide the model on how to process and convert text content in images.\n  - `rect_prompt`: Used to handle cases where specific areas (such as tables or images) are marked in the image.\n  - `role_prompt`: Defines the role of the model to ensure the model understands it is performing a PDF document parsing task.\n\n  You can pass custom prompts in the form of a dictionary to replace any of the prompts. Here is an example:\n\n```\nprompt = {\n    \"prompt\": \"Custom prompt text\",\n    \"rect_prompt\": \"Custom rect prompt\",\n    \"role_prompt\": \"Custom role prompt\"\n}\n\ncontent, image_paths = parse_pdf(\n    pdf_path=pdf_path,\n    output_dir='.\u002Foutput',\n    model=\"gpt-4o\",\n    prompt=prompt,\n    verbose=False,\n)\n```\n\n\n\n**args**: LLM other parameters, such as `temperature`, `top_p`, `max_tokens`, `presence_penalty`, `frequency_penalty`, etc.\n\n\n\n\n\n## Join Us 👏🏻\n\nScan the QR code below with WeChat to join our group chat or contribute.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\".\u002Fdocs\u002Fwechat.jpg\" alt=\"wechat\" width=400\u002F>\n\u003C\u002Fp>","gptpdf 是一个使用 GPT 模型将 PDF 文档解析为 Markdown 格式的工具。它通过 VLLM（如 GPT-4o）来识别和转换文档中的文字、数学公式、表格、图片和图表等内容，整个项目代码简洁，仅有293行。该工具利用 PyMuPDF 库识别 PDF 中的非文本区域，并用大型视觉模型进行进一步解析，平均每页处理成本仅为0.013美元。gptpdf 适合需要将复杂格式的 PDF 转换成易于编辑和阅读的 Markdown 文件的场景，特别适用于学术论文、技术报告等含有丰富格式元素的文档处理。此外，该项目还提供了一个基于此库开发的可视化工具 pdfgpt-ui，方便用户更直观地操作。","2026-06-11 03:41:07","high_star"]