[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71952":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},71952,"chandra","datalab-to\u002Fchandra","datalab-to","OCR model that handles complex tables, forms, handwriting with full layout.","https:\u002F\u002Fwww.datalab.to",null,"Python",11170,1158,78,41,0,48,147,620,144,119.19,"Apache License 2.0",false,"master",true,[27,28],"ai","ocr","2026-06-12 04:01:02","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fdatalab-logo.png\" alt=\"Datalab Logo\" width=\"150\"\u002F>\n\u003C\u002Fp>\n\u003Ch1 align=\"center\">Datalab\u003C\u002Fh1>\n\u003Cp align=\"center\">\n  \u003Cstrong>State of the Art models for Document Intelligence\u003C\u002Fstrong>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg\" alt=\"Code License\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.datalab.to\u002Fpricing\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20License-OpenRAIL--M-blue.svg\" alt=\"Model License\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FKuZwXNGnfH\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join%20us-5865F2?logo=discord&logoColor=white\" alt=\"Discord\">\u003C\u002Fa>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.datalab.to\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHomepage-datalab.to-blue\" alt=\"Homepage\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdocumentation.datalab.to\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocs-Read%20the%20docs-blue\" alt=\"Docs\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.datalab.to\u002Fplayground\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPlayground-Try%20it-orange\" alt=\"Public Playground\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Chr\u002F>\n\n# Chandra OCR 2\n\nChandra OCR 2 is a state of the art OCR model that converts images and PDFs into structured HTML\u002FMarkdown\u002FJSON while preserving layout information.\n\n## Try Chandra on Datalab\n\nOur managed platform runs an improved Chandra with higher accuracy than the open weights, zero data retention by default, SOC 2 Type 2, and custom BAAs.\n\nIf you have high volume workloads, we offer a batch processing service that has processed 200M+ pages per week — we manage the infrastructure so your workloads finish on time.\n\nGet started with **$5 in free credits** — [sign up](https:\u002F\u002Fwww.datalab.to\u002F?utm_source=gh-chandra) — takes under 30 seconds — or try Chandra in our [public playground](https:\u002F\u002Fwww.datalab.to\u002Fplayground?utm_source=gh-chandra).\n\nCommercial self-hosting requires a license — see [Commercial usage](#commercial-usage). For on-prem licensing, [contact us](https:\u002F\u002Fwww.datalab.to\u002Fcontact?utm_source=gh-chandra-onprem).\n\n## News\n\n- 3\u002F2026 - Chandra 2 is here with significant improvements to math, tables, layout, and multilingual OCR\n- 10\u002F2025 - Chandra 1 launched\n\n## Features\n\n- Tops external olmocr benchmark and significant improvement in internal multilingual benchmarks\n- Convert documents to markdown, html, or json with detailed layout information\n- Support for 90+ languages ([benchmark below](#multilingual-benchmark-table))\n- Excellent handwriting support\n- Reconstructs forms accurately, including checkboxes\n- Strong performance with tables, math, and complex layouts\n- Extracts images and diagrams, and adds captions and structured data\n- Two inference modes: local (HuggingFace) and remote (vLLM server)\n\n\u003Cimg src=\"assets\u002Fexamples\u002Fmath\u002Fhandwritten_math.png\" width=\"600px\"\u002F>\n\n## Quickstart\n\nThe easiest way to start is with the CLI tools:\n\n```shell\npip install chandra-ocr\n\n# With vLLM (recommended, lightweight install)\nchandra_vllm\nchandra input.pdf .\u002Foutput\n\n# With HuggingFace (requires torch)\npip install chandra-ocr[hf]\nchandra input.pdf .\u002Foutput --method hf\n\n# Interactive streamlit app\npip install chandra-ocr[app]\nchandra_app\n```\n\n## Benchmarks\n\nMultilingual performance was a focus for us with Chandra 2.  There isn't a good public multilingual OCR benchmark, so we made our own.  This tests tables, math, ordering, layout, and text accuracy.\n\n\u003Cimg src=\"assets\u002Fbenchmarks\u002Fmultilingual.png\" width=\"600px\"\u002F>\n\nSee full scores [below](#multilingual-benchmark-table). We also have a [full 90-language benchmark](FULL_BENCHMARKS.md).\n\nWe also benchmarked Chandra 2 with the widely accepted olmocr benchmark:\n\n\u003Cimg src=\"assets\u002Fbenchmarks\u002Fbench.png\" width=\"600px\"\u002F>\n\nSee full scores [below](#benchmark-table).\n\n## Examples\n\n| Type | Name                     | Link                                                                                                        |\n|------|--------------------------|-------------------------------------------------------------------------------------------------------------|\n| Math | CS229 Textbook           | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fmath\u002Fcs229.png)                    |\n| Math | Handwritten Math         | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fmath\u002Fhandwritten_math.png)         |\n| Math | Chinese Math             | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fmath\u002Fchinese_math.png)             |\n| Tables | Statistical Distribution | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Ftables\u002Fcomplex_tables.png)         |\n| Tables | Financial Table          | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Ftables\u002Ffinancial_table.png)        |\n| Forms | Registration Form        | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fforms\u002Fhandwritten_form.png)        |\n| Forms | Lease Form               | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fforms\u002Flease_filled.png)            |\n| Handwriting | Cursive Writing          | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fhandwriting\u002Fcursive_writing.png)   |\n| Handwriting | Handwritten Notes        | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fhandwriting\u002Fhandwritten_notes.png) |\n| Languages | Arabic                   | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Flanguages\u002Farabic.png)              |\n| Languages | Japanese                 | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Flanguages\u002Fjapanese.png)            |\n| Languages | Hindi                    | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Flanguages\u002Fhindi.png)               |\n| Languages | Russian                  | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Flanguages\u002Frussian.png)             |\n| Other | Charts                   | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fother\u002Fcharts.png)                  |\n| Other | Chemistry                | [View](https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra\u002Fblob\u002Fmaster\u002Fassets\u002Fexamples\u002Fother\u002Fchemistry.png)               |\n\n## Installation\n\n### Package\n\n```bash\n# Base install (for vLLM backend)\npip install chandra-ocr\n\n# With HuggingFace backend (includes torch, transformers)\npip install chandra-ocr[hf]\n\n# With all extras\npip install chandra-ocr[all]\n```\n\nIf you're using the HuggingFace method, we also recommend installing [flash attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) for better performance.\n\n### From Source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdatalab-to\u002Fchandra.git\ncd chandra\nuv sync\nsource .venv\u002Fbin\u002Factivate\n```\n\n## Usage\n\n### CLI\n\nProcess single files or entire directories:\n\n```bash\n# Single file, with vllm server (see below for how to launch vllm)\nchandra input.pdf .\u002Foutput --method vllm\n\n# Process all files in a directory with local model\nchandra .\u002Fdocuments .\u002Foutput --method hf\n```\n\n**CLI Options:**\n- `--method [hf|vllm]`: Inference method (default: vllm)\n- `--page-range TEXT`: Page range for PDFs (e.g., \"1-5,7,9-12\")\n- `--max-output-tokens INTEGER`: Max tokens per page\n- `--max-workers INTEGER`: Parallel workers for vLLM\n- `--include-images\u002F--no-images`: Extract and save images (default: include)\n- `--include-headers-footers\u002F--no-headers-footers`: Include page headers\u002Ffooters (default: exclude)\n- `--batch-size INTEGER`: Pages per batch (default: 28 for vllm, 1 for hf)\n\n**Output Structure:**\n\nEach processed file creates a subdirectory with:\n- `\u003Cfilename>.md` - Markdown output\n- `\u003Cfilename>.html` - HTML output\n- `\u003Cfilename>_metadata.json` - Metadata (page info, token count, etc.)\n- Extracted images are saved directly in the output directory\n\n### Streamlit Web App\n\nLaunch the interactive demo for single-page processing:\n\n```bash\nchandra_app\n```\n\n### vLLM Server (Optional)\n\nFor production deployments or batch processing, use the vLLM server:\n\n```bash\nchandra_vllm\n```\n\nThis launches a Docker container with optimized inference settings. Configure via environment variables:\n\n- `VLLM_API_BASE`: Server URL (default: `http:\u002F\u002Flocalhost:8000\u002Fv1`)\n- `VLLM_MODEL_NAME`: Model name for the server (default: `chandra`)\n- `VLLM_GPUS`: GPU device IDs (default: `0`)\n\nYou can also start your own vllm server with the `datalab-to\u002Fchandra-ocr-2` model.\n\n### Configuration\n\nSettings can be configured via environment variables or a `local.env` file:\n\n```bash\n# Model settings\nMODEL_CHECKPOINT=datalab-to\u002Fchandra-ocr-2\nMAX_OUTPUT_TOKENS=12384\n\n# vLLM settings\nVLLM_API_BASE=http:\u002F\u002Flocalhost:8000\u002Fv1\nVLLM_MODEL_NAME=chandra\nVLLM_GPUS=0\n```\n\n# Commercial usage\n\nThis code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $2M funding\u002Frevenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page [here](https:\u002F\u002Fwww.datalab.to\u002Fpricing?utm_source=gh-chandra).\n\n# Benchmark table\n\n| **Model**                 |  ArXiv   | Old Scans Math |  Tables  | Old Scans | Headers and Footers | Multi column | Long tiny text | Base |    Overall     | Source |\n|:--------------------------|:--------:|:--------------:|:--------:|:---------:|:-------------------:|:------------:|:--------------:|:----:|:--------------:|:------:|\n| Datalab API               | **90.4** | **90.2** | **90.7** | **54.6** |        91.6         |     83.7     |    **92.3**    | **99.9** | **86.7 ± 0.8** | Own benchmarks |\n| Chandra 2                 |   90.2   |   89.3   |   89.9   |   49.8   |        92.5         |     83.5     |      92.1      | 99.6 |   85.9 ± 0.8   | Own benchmarks |\n| dots.ocr 1.5              |   85.9   |   85.5   | **90.7** |   48.2   |        94.0         |   **85.3**   |      81.6      | 99.7 |   83.9         | dots.ocr repo |\n| Chandra 1                 |   82.2   |   80.3   |   88.0   |   50.4   |        90.8         |     81.2     |    **92.3**    | **99.9** |   83.1 ± 0.9   | Own benchmarks |\n| olmOCR 2                  |   83.0   |   82.3   |   84.9   |   47.7   |      **96.1**       |     83.7     |      81.9      | 99.6 |   82.4         | olmocr repo |\n| dots.ocr                  |   82.1   |   64.2   |   88.3   |   40.9   |        94.1         |     82.4     |      81.2      | 99.5 |   79.1 ± 1.0   | dots.ocr repo |\n| olmOCR v0.3.0             |   78.6   |   79.9   |   72.9   |   43.9   |        95.1         |     77.3     |      81.2      | 98.9 |   78.5 ± 1.1   | olmocr repo |\n| Datalab Marker v1.10.0    |   83.8   |   69.7   |   74.8   |   32.3   |        86.6         |     79.4     |      85.7      | 99.6 |   76.5 ± 1.0   | Own benchmarks |\n| Deepseek OCR              |   75.2   |   72.3   |   79.7   |   33.3   |      **96.1**       |     66.7     |      80.1      | 99.7 |   75.4 ± 1.0   | Own benchmarks |\n| Mistral OCR API           |   77.2   |   67.5   |   60.6   |   29.3   |        93.6         |     71.3     |      77.1      | 99.4 |   72.0 ± 1.1   | olmocr repo |\n| GPT-4o (Anchored)         |   53.5   |   74.5   |   70.0   |   40.7   |        93.8         |     69.3     |      60.6      | 96.8 |   69.9 ± 1.1   | olmocr repo |\n| Qwen 3 VL 8B              |   70.2   |   75.1   |   45.6   |   37.5   |        89.1         |     62.1     |      43.0      | 94.3 |   64.6 ± 1.1   | Own benchmarks |\n| Gemini Flash 2 (Anchored) |   54.5   |   56.1   |   72.1   |   34.2   |        64.7         |     61.5     |      71.5      | 95.6 |   63.8 ± 1.2   | olmocr repo |\n\n\n# Multilingual benchmark table\n\nThe table below covers the 43 most common languages, benchmarked across multiple models. For a comprehensive evaluation across 90 languages (Chandra 2 vs Gemini 2.5 Flash only), see the [full 90-language benchmark](#full-90-language-benchmark-table).\n\n| Language | Datalab API | Chandra 2 | Chandra 1 | Gemini 2.5 Flash | GPT-5 Mini |\n|---|:---:|:---:|:---:|:---:|:---:|\n| ar | 67.6% | 68.4% | 34.0% | 84.4% | 55.6% |\n| bn | 85.1% | 72.8% | 45.6% | 55.3% | 23.3% |\n| ca | 88.7% | 85.1% | 84.2% | 88.0% | 78.5% |\n| cs | 88.2% | 85.3% | 84.7% | 79.1% | 78.8% |\n| da | 90.1% | 91.1% | 88.4% | 86.0% | 87.7% |\n| de | 93.8% | 94.8% | 83.0% | 88.3% | 93.8% |\n| el | 89.9% | 85.6% | 85.5% | 83.5% | 82.4% |\n| es | 91.8% | 89.3% | 88.7% | 86.8% | 97.1% |\n| fa | 82.2% | 75.1% | 69.6% | 61.8% | 56.4% |\n| fi | 85.7% | 83.4% | 78.4% | 86.0% | 84.7% |\n| fr | 93.3% | 93.7% | 89.6% | 86.1% | 91.1% |\n| gu | 73.8% | 70.8% | 44.6% | 47.6% | 11.5% |\n| he | 76.4% | 70.4% | 38.9% | 50.9% | 22.3% |\n| hi | 80.5% | 78.4% | 70.2% | 82.7% | 41.0% |\n| hr | 93.4% | 90.1% | 85.9% | 88.2% | 81.3% |\n| hu | 88.1% | 82.1% | 82.5% | 84.5% | 84.8% |\n| id | 91.3% | 91.6% | 86.7% | 88.3% | 89.7% |\n| it | 94.4% | 94.1% | 89.1% | 85.7% | 91.6% |\n| ja | 87.3% | 86.9% | 85.4% | 80.0% | 76.1% |\n| jv | 87.5% | 73.2% | 85.1% | 80.4% | 69.6% |\n| kn | 70.0% | 63.2% | 20.6% | 24.5% | 10.1% |\n| ko | 89.1% | 81.5% | 82.3% | 84.8% | 78.4% |\n| la | 78.0% | 73.8% | 55.9% | 70.5% | 54.6% |\n| ml | 72.4% | 64.3% | 18.1% | 23.8% | 11.9% |\n| mr | 80.8% | 75.0% | 57.0% | 69.7% | 20.9% |\n| nl | 90.0% | 88.6% | 85.3% | 87.5% | 83.8% |\n| no | 89.2% | 90.3% | 85.5% | 87.8% | 87.4% |\n| pl | 93.8% | 91.5% | 83.9% | 89.7% | 90.4% |\n| pt | 97.0% | 95.2% | 84.3% | 89.4% | 90.8% |\n| ro | 86.2% | 84.5% | 82.1% | 76.1% | 77.3% |\n| ru | 88.8% | 85.5% | 88.7% | 82.8% | 72.2% |\n| sa | 57.5% | 51.1% | 33.6% | 44.6% | 12.5% |\n| sr | 95.3% | 90.3% | 82.3% | 89.7% | 83.0% |\n| sv | 91.9% | 92.8% | 82.1% | 91.1% | 92.1% |\n| ta | 82.9% | 77.7% | 50.8% | 53.9% | 8.1% |\n| te | 69.4% | 58.6% | 19.5% | 33.3% | 9.9% |\n| th | 71.6% | 62.6% | 47.0% | 66.7% | 53.8% |\n| tr | 88.9% | 84.1% | 68.1% | 84.1% | 78.2% |\n| uk | 93.1% | 91.0% | 88.5% | 87.9% | 81.9% |\n| ur | 54.1% | 43.2% | 28.1% | 57.6% | 16.9% |\n| vi | 85.0% | 80.4% | 81.6% | 89.5% | 83.6% |\n| zh | 87.8% | 88.7% | 88.3% | 70.0% | 70.4% |\n| **Average** | **80.4%** | **77.8%** | **69.4%** | **67.6%** | **60.5%** |\n\n# Full 90-language benchmark table\n\nWe also have a more comprehensive evaluation covering 90 languages, comparing Chandra 2 against Gemini 2.5 Flash. The average scores are lower than the 43-language table above because this includes many lower-resource languages. Chandra 2 averages **72.7%** vs Gemini 2.5 Flash at **60.8%**.\n\nSee the [full 90-language results](FULL_BENCHMARKS.md).\n\n## Throughput\n\nBenchmarked with vLLM on a single NVIDIA H100 80GB GPU using a diverse mix of documents (math, tables, scans, multi-column layouts) from the olmOCR benchmark set.  This set is significantly slower than real-world usage - we estimate 2 pages\u002Fs in real-world usage.\n\n| Configuration | Pages\u002Fsec | Avg Latency | P95 Latency | Failure Rate |\n|---|:---:|:---:|:---:|:---:|\n| vLLM, 96 concurrent sequences | 1.44 | 60s | 156s | 0% |\n\n# Credits\n\nThank you to the following open source projects:\n\n- [Huggingface Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n- [VLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n- [olmocr](https:\u002F\u002Fgithub.com\u002Fallenai\u002Folmocr)\n- [Qwen 3.5](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3)","Chandra OCR 2 是一款先进的光学字符识别模型，能够处理图像和PDF文件，并将其转换为带有完整布局信息的结构化HTML\u002FMarkdown\u002FJSON格式。其核心功能包括支持90多种语言、强大的手写识别能力、精确的表格与数学公式重建以及复杂文档布局的高精度提取。该模型采用Python开发，通过本地（HuggingFace）或远程（vLLM服务器）两种推理模式运行。适用于需要从扫描文档中高效准确地提取信息的各种场景，如企业文档管理、学术研究资料整理等。",2,"2026-06-11 03:39:38","high_star"]