[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-70510":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},70510,"GLM-OCR","zai-org\u002FGLM-OCR","zai-org","GLM-OCR: Accurate ×  Fast × Comprehensive",null,"https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-OCR","Python",6935,638,32,29,0,25,82,442,75,39.42,false,"main",[25,26,27],"glm","image2text","ocr","2026-06-12 02:02:34","## GLM-OCR\n\n[中文阅读](README_zh.md)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=resources\u002Flogo.svg width=\"40%\"\u002F>\n\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n    👋 Join our \u003Ca href=\"resources\u002FWECHAT.md\" target=\"_blank\">WeChat\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FQR7SARHRxK\" target=\"_blank\">Discord\u003C\u002Fa> community\n    \u003Cbr>\n    📖 Check out the GLM-OCR \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10910\" target=\"_blank\">technical report\u003C\u002Fa>\n    \u003Cbr>\n    📍 Use GLM-OCR's \u003Ca href=\"https:\u002F\u002Fdocs.z.ai\u002Fguides\u002Fvlm\u002Fglm-ocr\" target=\"_blank\">API\u003C\u002Fa>\n\u003C\u002Fp>\n\n### Model Introduction\n\nGLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.\n\n**Key Features**\n\n- **State-of-the-Art Performance**: Achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall, and delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.\n\n- **Optimized for Real-World Scenarios**: Designed and optimized for practical business use cases, maintaining robust performance on complex tables, code-heavy documents, seals, and other challenging real-world layouts.\n\n- **Efficient Inference**: With only 0.9B parameters, GLM-OCR supports deployment via vLLM, SGLang, and Ollama, significantly reducing inference latency and compute cost, making it ideal for high-concurrency services and edge deployments.\n\n- **Easy to Use**: Fully open-sourced and equipped with a comprehensive [SDK](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-OCR) and inference toolchain, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.\n\n### News & Updates\n\n- **[2026.3.12]** GLM-OCR SDK now supports agent-friendly Skill mode — just `pip install glmocr` + set API key, ready to use via CLI or Python with no GPU or YAML config needed. See: [GLM-OCR Skill](skills\u002Fglmocr\u002FSKILL.md)\n- **[2026.3.12]** GLM-OCR Technical Report is now available. See: [GLM-OCR Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10910)\n- **[2026.2.12]** Fine-tuning tutorial based on LLaMA-Factory is now available. See: [GLM-OCR Fine-tuning Guide](examples\u002Ffinetune\u002FREADME.md)\n\n### Download Model\n\n| Model   | Download Links                                                                                                              | Precision |\n| ------- | --------------------------------------------------------------------------------------------------------------------------- | --------- |\n| GLM-OCR | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-OCR)\u003Cbr> [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-OCR) | BF16      |\n\n## GLM-OCR SDK\n\nWe provide an SDK for using GLM-OCR more efficiently and conveniently.\n\n### Install SDK\n\nChoose the lightest installation that matches your scenario:\n\n```bash\n# Cloud \u002F MaaS + local images \u002F PDFs (fastest install)\npip install glmocr\n\n# Self-hosted pipeline (layout detection)\npip install \"glmocr[selfhosted]\"\n\n# Flask service support\npip install \"glmocr[server]\"\n```\n\nInstall from source for development:\n\n```bash\n# Install from source\ngit clone https:\u002F\u002Fgithub.com\u002Fzai-org\u002Fglm-ocr.git\ncd glm-ocr\nuv venv --python 3.12 --seed && source .venv\u002Fbin\u002Factivate\nuv pip install -e .\n```\n\n### Model Deployment\n\nTwo ways to use GLM-OCR:\n\n#### Option 1: Zhipu MaaS API (Recommended for Quick Start)\n\nUse the hosted cloud API – no GPU needed. The cloud service runs the complete GLM-OCR pipeline internally, so the SDK simply forwards your request and returns the result.\n\n1. Get an API key from https:\u002F\u002Fopen.bigmodel.cn\n2. Configure `config.yaml`:\n\n```yaml\npipeline:\n  maas:\n    enabled: true # Enable MaaS mode\n    api_key: your-api-key # Required\n```\n\nThat's it! When `maas.enabled=true`, the SDK acts as a thin wrapper that:\n\n- Forwards your documents to the Zhipu cloud API\n- Returns the results directly (Markdown + JSON layout details)\n- No local processing, no GPU required\n\nInput note (MaaS): the upstream API accepts `file` as a URL or a `data:\u003Cmime>;base64,...` data URI.\nIf you have raw base64 without the `data:` prefix, wrap it as a data URI (recommended). The SDK will\nauto-wrap local file paths \u002F bytes \u002F raw base64 into a data URI when calling MaaS.\n\nAPI documentation: https:\u002F\u002Fdocs.bigmodel.cn\u002Fcn\u002Fguide\u002Fmodels\u002Fvlm\u002Fglm-ocr\n\n#### Option 2: Self-host with vLLM \u002F SGLang\n\nDeploy the GLM-OCR model locally for full control. The SDK provides the complete pipeline: layout detection, parallel region OCR, and result formatting.\n\nInstall the self-hosted extra first:\n\n```bash\npip install \"glmocr[selfhosted]\"\n```\n\n##### Using vLLM\n\nInstall vLLM:\n\n```bash\ndocker pull vllm\u002Fvllm-openai:v0.19.0-ubuntu2404\n```\n\nOr using with pip:\n\n```bash\npip install -U \"vllm>=0.19.0\"\n```\n\nLaunch the service:\n\n```bash\npip install \"transformers>=5.3.0\"\n\nvllm serve zai-org\u002FGLM-OCR  --port 8080 --speculative-config '{\"method\": \"mtp\", \"num_speculative_tokens\": 3}' --served-model-name glm-ocr\n```\n\n>Note\n  Add `--max-model-len` and `--gpu-memory-utilization` according to Your own machine to handle large image\u002Fpdf\n\n##### Using SGLang\n\nInstall SGLang:\n\n```bash\ndocker pull lmsysorg\u002Fsglang:v0.5.10\n```\n\nOr using with pip:\n\n```bash\npip install \"sglang>=0.5.10\"\n```\n\nLaunch the service:\n\n```bash\nSGLANG_ENABLE_SPEC_V2=1 sglang serve --model-path zai-org\u002FGLM-OCR --port 8080 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --served-model-name glm-ocr\n```\n\n>Note\n  Add `--context-len` and `--mem-fraction-static` according to Your own machine to handle large image\u002Fpdf\n\n\n#### Option 3: Ollama\u002FMLX\n\nFor specialized deployment scenarios, see the detailed guides:\n\n- **[Apple Silicon with mlx-vlm](examples\u002Fmlx-deploy\u002FREADME.md)** - Optimized for Apple Silicon Macs\n- **[Ollama Deployment](examples\u002Follama-deploy\u002FREADME.md)** - Simple local deployment with Ollama\n\n#### Option 4: SDK Server + Client (GPU-less Client)\n\nDeploy the SDK Server on a GPU machine, then use any machine as a client — no GPU needed on the client side. The client connects via the MaaS-compatible protocol, pointing `api_url` at your self-hosted server.\n\n```yaml\n# Client config.yaml\npipeline:\n  maas:\n    enabled: true\n    api_url: http:\u002F\u002F\u003CSERVER_IP>:5002\u002Fglmocr\u002Fparse\n    api_key: any-string    # self-hosted server does not validate keys\n    verify_ssl: false\n```\n\nSee the full guide: **[Self-hosted SDK Server + Client](examples\u002Fself-host\u002FREADME.md)**\n\n#### Update Configuration\n\nAfter launching the service, configure `config.yaml`:\n\n```yaml\npipeline:\n  maas:\n    enabled: false # Disable MaaS mode (default)\n  ocr_api:\n    api_host: localhost # or your vLLM\u002FSGLang server address\n    api_port: 8080\n```\n\n### SDK Usage Guide\n\n#### CLI\n\n```bash\n# Parse a single image\nglmocr parse examples\u002Fsource\u002Fcode.png\n\n# Parse a directory\nglmocr parse examples\u002Fsource\u002F\n\n# Set output directory\nglmocr parse examples\u002Fsource\u002Fcode.png --output .\u002Fresults\u002F\n\n# Use a custom config\nglmocr parse examples\u002Fsource\u002Fcode.png --config my_config.yaml\n\n# Enable debug logging with profiling\nglmocr parse examples\u002Fsource\u002Fcode.png --log-level DEBUG\n\n# Run layout detection on CPU (keep GPU free for OCR model)\nglmocr parse examples\u002Fsource\u002Fcode.png --layout-device cpu\n\n# Run layout detection on a specific GPU\nglmocr parse examples\u002Fsource\u002Fcode.png --layout-device cuda:1\n\n# Override any config value via --set (dotted path, repeatable)\nglmocr parse examples\u002Fsource\u002Fcode.png --set pipeline.ocr_api.api_port 8080\nglmocr parse examples\u002Fsource\u002F --set pipeline.layout.use_polygon true --set logging.level DEBUG\n```\n\n#### Python API\n\n```python\nfrom glmocr import GlmOcr, parse\n\n# Simple function\nresult = parse(\"image.png\")\nresult = parse([\"img1.png\", \"img2.jpg\"])\nresult = parse(\"https:\u002F\u002Fexample.com\u002Fimage.png\")\nresult.save(output_dir=\".\u002Fresults\")\n\n# Note: a list is treated as pages of a single document.\n\n# Class-based API\nwith GlmOcr() as parser:\n    result = parser.parse(\"image.png\")\n    print(result.json_result)\n    result.save()\n\n# Place layout model on CPU (useful when GPU is reserved for OCR)\nwith GlmOcr(layout_device=\"cpu\") as parser:\n    result = parser.parse(\"image.png\")\n\n# Place layout model on a specific GPU\nwith GlmOcr(layout_device=\"cuda:1\") as parser:\n    result = parser.parse(\"image.png\")\n```\n\n#### Flask Service\n\nInstall the optional server dependency first:\n\n```bash\npip install \"glmocr[server]\"\n```\n\n```bash\n# Start service\npython -m glmocr.server\n\n# With debug logging\npython -m glmocr.server --log-level DEBUG\n\n# Call API\ncurl -X POST http:\u002F\u002Flocalhost:5002\u002Fglmocr\u002Fparse \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"images\": [\".\u002Fexample\u002Fsource\u002Fcode.png\"]}'\n```\n\nSemantics:\n\n- `images` can be a string or a list.\n- A list is treated as pages of a single document.\n- For multiple independent documents, call the endpoint multiple times (one document per request).\n\n\n### Modular Architecture\n\nGLM-OCR uses composable modules for easy customization:\n\n| Component             | Description                            |\n| --------------------- | -------------------------------------- |\n| `PageLoader`          | Preprocessing and image encoding       |\n| `OCRClient`           | Calls the GLM-OCR model service        |\n| `PPDocLayoutDetector` | PP-DocLayout layout detection          |\n| `ResultFormatter`     | Post-processing, outputs JSON\u002FMarkdown |\n\nYou can extend the behavior by creating custom pipelines:\n\n```python\nfrom glmocr.dataloader import PageLoader\nfrom glmocr.ocr_client import OCRClient\nfrom glmocr.postprocess import ResultFormatter\n\n\nclass MyPipeline:\n  def __init__(self, config):\n    self.page_loader = PageLoader(config)\n    self.ocr_client = OCRClient(config)\n    self.formatter = ResultFormatter(config)\n\n  def process(self, request_data):\n    # Implement your own processing logic\n    pass\n```\n\n## Star History\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F?repos=zai-org%2FGLM-OCR&type=date&legend=top-left\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fimage?repos=zai-org\u002FGLM-OCR&type=date&theme=dark&legend=top-left\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fimage?repos=zai-org\u002FGLM-OCR&type=date&legend=top-left\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fimage?repos=zai-org\u002FGLM-OCR&type=date&legend=top-left\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n## Acknowledgement\n\nThis project is inspired by the excellent work of the following projects and communities:\n\n- [PP-DocLayout-V3](https:\u002F\u002Fhuggingface.co\u002FPaddlePaddle\u002FPP-DocLayoutV3)\n- [PaddleOCR](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleOCR)\n- [MinerU](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU)\n\n## License\n\nThe Code of this repo is under Apache License 2.0.\n\nThe GLM-OCR model is released under the MIT License.\n\nThe complete OCR pipeline integrates [PP-DocLayoutV3](https:\u002F\u002Fhuggingface.co\u002FPaddlePaddle\u002FPP-DocLayoutV3) for document layout analysis, which is licensed under the Apache License 2.0. Users should comply with both licenses when using this project.\n\n## Citation\n\nIf you find GLM-OCR useful in your research, please cite our technical report:\n\n```bibtex\n@misc{duan2026glmocrtechnicalreport,\n      title={GLM-OCR Technical Report},\n      author={Shuaiqi Duan and Yadong Xue and Weihan Wang and Zhe Su and Huan Liu and Sheng Yang and Guobing Gan and Guo Wang and Zihan Wang and Shengdong Yan and Dexin Jin and Yuxuan Zhang and Guohong Wen and Yanfeng Wang and Yutao Zhang and Xiaohan Zhang and Wenyi Hong and Yukuo Cen and Da Yin and Bin Chen and Wenmeng Yu and Xiaotao Gu and Jie Tang},\n      year={2026},\n      eprint={2603.10910},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10910},\n}\n```\n","GLM-OCR 是一个面向复杂文档理解的多模态OCR模型，基于GLM-V编码器-解码器架构。它通过引入多令牌预测（MTP）损失和全任务强化学习来提高训练效率、识别准确性和泛化能力。该项目集成了在大规模图文数据上预训练的CogViT视觉编码器、轻量级跨模态连接器以及高效的令牌下采样技术，搭配GLM-0.5B语言解码器，并结合PP-DocLayout-V3布局分析与并行识别两阶段流程，实现对多样化文档布局的强大且高质量的OCR性能。该模型在OmniDocBench V1.5等基准测试中达到领先水平，特别适用于处理包含复杂表格、代码密集型文档及印章等实际业务场景中的挑战性布局。此外，仅含0.9B参数的GLM-OCR支持多种部署方式，显著降低了推理延迟和计算成本，非常适合高并发服务及边缘设备部署。此项目完全开源，提供全面的SDK和工具链支持，便于集成到现有生产流程中。",2,"2026-06-11 03:32:33","trending"]