[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74787":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":34,"readmeContent":35,"aiSummary":36,"trendingCount":16,"starSnapshotCount":16,"syncStatus":37,"lastSyncTime":38,"discoverSource":39},74787,"liteparse","run-llama\u002Fliteparse","run-llama","A fast, helpful, and open-source document parser","https:\u002F\u002Fdevelopers.llamaindex.ai\u002Fliteparse\u002F",null,"Rust",9877,629,34,14,0,64,762,4804,404,114.4,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33],"document-ocr","document-processing","ocr","ocr-recognition","pdf","pdf-parser","text-extraction","2026-06-12 04:01:15","# LiteParse\n\n[![CI](https:\u002F\u002Fgithub.com\u002Frun-llama\u002Fliteparse\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Frun-llama\u002Fliteparse\u002Factions\u002Fworkflows\u002Fci.yml)\n|\n[![npm version](https:\u002F\u002Fimg.shields.io\u002Fnpm\u002Fv\u002F@llamaindex\u002Fliteparse.svg)](https:\u002F\u002Fwww.npmjs.com\u002Fpackage\u002F@llamaindex\u002Fliteparse)\n|\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n|\n[Docs](https:\u002F\u002Fdevelopers.llamaindex.ai\u002Fliteparse\u002F)\n\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F07ba6a82-6bb1-4dea-b0ef-cad7df7d1622\" alt=\"out\" width=\"600\">\n\nLiteParse is a standalone OSS PDF parsing tool focused exclusively on **fast and light** parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine. \n\n**Hitting the limits of local parsing?**\nFor complex documents (dense tables, multi-column layouts, charts, handwritten text, or \nscanned PDFs), you'll get significantly better results with [LlamaParse](https:\u002F\u002Fdevelopers.llamaindex.ai\u002Fpython\u002Fcloud\u002Fllamaparse\u002F?utm_source=github&utm_medium=liteparse), \nour cloud-based document parser built for production document pipelines. LlamaParse handles the \nhard stuff so your models see clean, structured data and markdown.\n\n>  👉 [Sign up for LlamaParse free](https:\u002F\u002Fcloud.llamaindex.ai?utm_source=github&utm_medium=liteparse)\n\n## Overview\n\n- **Fast Text Parsing**: Spatial text parsing using PDF.js\n- **Flexible OCR System**:\n  - **Built-in**: Tesseract.js (zero setup, works out of the box!)\n  - **HTTP Servers**: Plug in any OCR server (EasyOCR, PaddleOCR, custom)\n  - **Standard API**: Simple, well-defined OCR API specification\n- **Screenshot Generation**: Generate high-quality page screenshots for LLM agents\n- **Multiple Output Formats**: JSON and Text\n- **Bounding Boxes**: Precise text positioning information\n- **Standalone Binary**: No cloud dependencies, runs entirely locally\n- **Multi-platform**: Linux, macOS (Intel\u002FARM), Windows\n\n## Installation\n\n### CLI Tool\n\n#### Option 1: Global Install (Recommended)\n\nInstall globally via npm to use the `lit` command anywhere:\n\n```bash\nnpm i -g @llamaindex\u002Fliteparse\n```\n\nThen use it:\n\n```bash\nlit parse document.pdf\nlit screenshot document.pdf\n```\n\nFor macOS and Linux users, `liteparse` can be also installed via `brew`:\n\n```bash\nbrew tap run-llama\u002Fliteparse\nbrew install llamaindex-liteparse\n```\n\n#### Option 2: Install from Source\n\nYou can clone the repo and install the CLI globally from source:\n\n```\ngit clone https:\u002F\u002Fgithub.com\u002Frun-llama\u002Fliteparse.git\ncd liteparse\nnpm run build\nnpm pack\nnpm install -g .\u002Fllamaindex-liteparse-*.tgz\n```\n\n### Agent Skill\n\nYou can use `liteparse` as an agent skill, downloading it with the `skills` CLI tool:\n\n```bash\nnpx skills add run-llama\u002Fllamaparse-agent-skills --skill liteparse\n```\n\nOr copy-pasting the [`SKILL.md`](https:\u002F\u002Fgithub.com\u002Frun-llama\u002Fllamaparse-agent-skills\u002Fblob\u002Fmain\u002Fskills\u002Fliteparse\u002FSKILL.md) file to your own skills setup.\n\n## Usage\n\n### Parse Files\n\n```bash\n# Basic parsing\nlit parse document.pdf\n\n# Parse with specific format\nlit parse document.pdf --format json -o output.md\n\n# Parse specific pages\nlit parse document.pdf --target-pages \"1-5,10,15-20\"\n\n# Parse without OCR\nlit parse document.pdf --no-ocr\n\n# Parse a remote PDF\ncurl -sL https:\u002F\u002Fexample.com\u002Freport.pdf | lit parse -\n```\n\n### Batch Parsing\n\nYou can also parse an entire directory of documents:\n\n```bash\nlit batch-parse .\u002Finput-directory .\u002Foutput-directory\n```\n\n### Generate Screenshots\n\nScreenshots are essential for LLM agents to extract visual information that text alone cannot capture.\n\n```bash\n# Screenshot all pages\nlit screenshot document.pdf -o .\u002Fscreenshots\n\n# Screenshot specific pages\nlit screenshot document.pdf --target-pages \"1,3,5\" -o .\u002Fscreenshots\n\n# Custom DPI\nlit screenshot document.pdf --dpi 300 -o .\u002Fscreenshots\n\n# Screenshot page range\nlit screenshot document.pdf --target-pages \"1-10\" -o .\u002Fscreenshots\n```\n\n### Library Usage\n\nInstall as a dependency in your project:\n\n```bash\nnpm install @llamaindex\u002Fliteparse\n# or\npnpm add @llamaindex\u002Fliteparse\n```\n\n```typescript\nimport { LiteParse } from '@llamaindex\u002Fliteparse';\n\nconst parser = new LiteParse({ ocrEnabled: true });\nconst result = await parser.parse('document.pdf');\nconsole.log(result.text);\n```\n\n#### Buffer \u002F Uint8Array Input\n\nYou can pass raw bytes directly instead of a file path, which is useful for remote files:\n\n```typescript\nimport { LiteParse } from '@llamaindex\u002Fliteparse';\nimport { readFile } from 'fs\u002Fpromises';\n\nconst parser = new LiteParse();\n\n\u002F\u002F From a file read\nconst pdfBytes = await readFile('document.pdf');\nconst result = await parser.parse(pdfBytes);\n\n\u002F\u002F From an HTTP response\nconst response = await fetch('https:\u002F\u002Fexample.com\u002Fdocument.pdf');\nconst buffer = Buffer.from(await response.arrayBuffer());\nconst result2 = await parser.parse(buffer);\n```\n\nNon-PDF buffers (images, Office documents) are written to a temp directory for format conversion. Screenshots also work with buffer input:\n\n```typescript\nconst screenshots = await parser.screenshot(pdfBytes, [1, 2, 3]);\n```\n\n### Browser Usage\n\nLiteParse's core parsing engine (PDF.js text extraction, grid projection, OCR via Tesseract.js) can run in the browser. Since the library has Node-only dependencies (sharp, fs, child_process), you'll need a bundler like Vite to swap those out with browser stubs.\n\n#### Vite Configuration\n\nThe key is a Vite plugin that redirects Node-only source files to browser-safe replacements, plus `resolve.alias` entries that stub out Node built-in modules:\n\n```typescript\n\u002F\u002F vite.config.ts\nimport { defineConfig, type Plugin } from \"vite\";\nimport { resolve, dirname } from \"node:path\";\n\n\u002F\u002F Node-only files → browser stubs (you write these)\nconst FILE_REDIRECTS = [\n  { match: \u002F\\\u002Fengines\\\u002Fpdf\\\u002Fpdfium-renderer(\\.js|\\.ts)?$\u002F, target: \"stubs\u002Fpdfium-renderer.ts\" },\n  { match: \u002F\\\u002Fengines\\\u002Fpdf\\\u002FpdfjsImporter(\\.js|\\.ts)?$\u002F,   target: \"stubs\u002FpdfjsImporter.ts\" },\n  { match: \u002F\\\u002Fengines\\\u002Focr\\\u002Fhttp-simple(\\.js|\\.ts)?$\u002F,     target: \"stubs\u002Fhttp-simple.ts\" },\n  { match: \u002F\\\u002Fconversion\\\u002FconvertToPdf(\\.js|\\.ts)?$\u002F,      target: \"stubs\u002FconvertToPdf.ts\" },\n  { match: \u002F\\\u002Fprocessing\\\u002FgridDebugLogger(\\.js|\\.ts)?$\u002F,   target: \"stubs\u002FgridDebugLogger.ts\" },\n  { match: \u002F\\\u002Fprocessing\\\u002FgridVisualizer(\\.js|\\.ts)?$\u002F,    target: \"stubs\u002FgridVisualizer.ts\" },\n];\n\nfunction liteparseNodeRedirects(): Plugin {\n  return {\n    name: \"liteparse-node-redirects\",\n    enforce: \"pre\",\n    async resolveId(source, importer) {\n      if (!importer) return null;\n      const abs = source.startsWith(\".\") ? resolve(dirname(importer), source) : source;\n      for (const { match, target } of FILE_REDIRECTS) {\n        if (match.test(abs) || match.test(source)) return resolve(target);\n      }\n      return null;\n    },\n  };\n}\n\nexport default defineConfig({\n  plugins: [liteparseNodeRedirects()],\n  optimizeDeps: { include: [\"tesseract.js\"] },\n  resolve: {\n    alias: [\n      { find: \"node:fs\u002Fpromises\", replacement: \"stubs\u002Fempty.ts\" },\n      { find: \"node:fs\",          replacement: \"stubs\u002Fempty.ts\" },\n      { find: \"node:url\",         replacement: \"stubs\u002Fempty.ts\" },\n      { find: \"node:path\",        replacement: \"stubs\u002Fempty.ts\" },\n      { find: \"node:os\",          replacement: \"stubs\u002Fempty.ts\" },\n      { find: \"node:child_process\", replacement: \"stubs\u002Fempty.ts\" },\n      { find: \u002F^fs$\u002F,             replacement: \"stubs\u002Fempty.ts\" },\n      { find: \u002F^path$\u002F,           replacement: \"stubs\u002Fempty.ts\" },\n      { find: \u002F^os$\u002F,             replacement: \"stubs\u002Fempty.ts\" },\n      { find: \u002F^child_process$\u002F,  replacement: \"stubs\u002Fempty.ts\" },\n      { find: \"form-data\",        replacement: \"stubs\u002Fempty.ts\" },\n      { find: \"axios\",            replacement: \"stubs\u002Fempty.ts\" },\n      { find: \"file-type\",        replacement: \"stubs\u002Ffile-type.ts\" },\n    ],\n  },\n});\n```\n\nSee [`scripts\u002Fbrowser-compat\u002F`](scripts\u002Fbrowser-compat\u002F) for a complete working example with all the stub files.\n\n#### What works in the browser\n\n- PDF parsing from `Uint8Array` input (use `file.arrayBuffer()` to get bytes from a `\u003Cinput type=\"file\">`)\n- OCR via Tesseract.js (runs in Web Workers, fetches language data from CDN on first use)\n- Text and JSON output formats\n\n#### What doesn't work\n\n- File path input (pass `Uint8Array` instead)\n- DOCX\u002FXLSX\u002FPPTX\u002Fimage conversion (requires LibreOffice\u002FImageMagick)\n- HTTP OCR server backend\n- Screenshots (these use PDFium + sharp, which are native Node addons)\n\n### CLI Options\n\n#### Parse Command\n\n```\n$ lit parse --help\nUsage: lit parse [options] \u003Cfile>\n\nParse a document file (PDF, DOCX, XLSX, PPTX, images, etc.)\n\nOptions:\n  -o, --output \u003Cfile>     Output file path\n  --format \u003Cformat>       Output format: json|text (default: \"text\")\n  --ocr-server-url \u003Curl>  HTTP OCR server URL (uses Tesseract if not provided)\n  --no-ocr                Disable OCR\n  --ocr-language \u003Clang>   OCR language(s) (default: \"en\")\n  --num-workers \u003Cn>       Number of pages to OCR in parallel (default: CPU cores - 1)\n  --max-pages \u003Cn>         Max pages to parse (default: \"10000\")\n  --target-pages \u003Cpages>  Target pages (e.g., \"1-5,10,15-20\")\n  --dpi \u003Cdpi>             DPI for rendering (default: \"150\")\n  --no-precise-bbox       Disable precise bounding boxes\n  --preserve-small-text   Preserve very small text\n  --password \u003Cpassword>   Password for encrypted\u002Fprotected documents\n  --config \u003Cfile>         Config file (JSON)\n  -q, --quiet             Suppress progress output\n  -h, --help              display help for command\n```\n\n#### Batch Parse Command\n\n```\n$ lit batch-parse --help\nUsage: lit batch-parse [options] \u003Cinput-dir> \u003Coutput-dir>\n\nParse multiple documents in batch mode (reuses PDF engine for efficiency)\n\nOptions:\n  --format \u003Cformat>       Output format: json|text (default: \"text\")\n  --ocr-server-url \u003Curl>  HTTP OCR server URL (uses Tesseract if not provided)\n  --no-ocr                Disable OCR\n  --ocr-language \u003Clang>   OCR language(s) (default: \"en\")\n  --num-workers \u003Cn>       Number of pages to OCR in parallel (default: CPU cores - 1)\n  --max-pages \u003Cn>         Max pages to parse per file (default: \"10000\")\n  --dpi \u003Cdpi>             DPI for rendering (default: \"150\")\n  --no-precise-bbox       Disable precise bounding boxes\n  --recursive             Recursively search input directory\n  --extension \u003Cext>       Only process files with this extension (e.g., \".pdf\")\n  --password \u003Cpassword>   Password for encrypted\u002Fprotected documents (applied to all files)\n  --config \u003Cfile>         Config file (JSON)\n  -q, --quiet             Suppress progress output\n  -h, --help              display help for command\n```\n\n#### Screenshot Command\n\n```\n$ lit screenshot --help\nUsage: lit screenshot [options] \u003Cfile>\n\nGenerate screenshots of PDF pages\n\nOptions:\n  -o, --output-dir \u003Cdir>  Output directory for screenshots (default: \".\u002Fscreenshots\")\n  --target-pages \u003Cpages>  Page numbers to screenshot (e.g., \"1,3,5\" or \"1-5\")\n  --dpi \u003Cdpi>             DPI for rendering (default: \"150\")\n  --format \u003Cformat>       Image format: png|jpg (default: \"png\")\n  --password \u003Cpassword>   Password for encrypted\u002Fprotected documents\n  --config \u003Cfile>         Config file (JSON)\n  -q, --quiet             Suppress progress output\n  -h, --help              display help for command\n```\n\n## OCR Setup\n\n### Default: Tesseract.js\n\n```bash\n# Tesseract is enabled by default\nlit parse document.pdf\n\n# Specify language\nlit parse document.pdf --ocr-language fra\n\n# Disable OCR\nlit parse document.pdf --no-ocr\n```\n\nBy default, Tesseract.js downloads language data from the internet on first use. For offline or air-gapped environments, set the `TESSDATA_PREFIX` environment variable to a directory containing pre-downloaded `.traineddata` files:\n\n```bash\nexport TESSDATA_PREFIX=\u002Fpath\u002Fto\u002Ftessdata\nlit parse document.pdf --ocr-language eng\n```\n\nYou can also pass `tessdataPath` in the library config:\n\n```typescript\nconst parser = new LiteParse({ tessdataPath: '\u002Fpath\u002Fto\u002Ftessdata' });\n```\n\n### Optional: HTTP OCR Servers\n\nFor higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:\n\n- [EasyOCR](ocr\u002Feasyocr\u002FREADME.md)\n- [PaddleOCR](ocr\u002Fpaddleocr\u002FREADME.md)\n\nYou can integrate any OCR service by implementing the simple LiteParse OCR API specification (see [`OCR_API_SPEC.md`](OCR_API_SPEC.md)).\n\nThe API requires:\n- POST `\u002Focr` endpoint\n- Accepts `file` and `language` parameters\n- Returns JSON: `{ results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }`\n\nSee the example servers in `ocr\u002Feasyocr\u002F` and `ocr\u002Fpaddleocr\u002F` as templates.\n\nFor the complete OCR API specification, see [`OCR_API_SPEC.md`](OCR_API_SPEC.md).\n\n## Multi-Format Input Support\n\nLiteParse supports **automatic conversion** of various document formats to PDF before parsing. This makes it unique compared to other PDF-only parsing tools!\n\n### Supported Input Formats\n\n#### Office Documents (via LibreOffice)\n- **Word**: `.doc`, `.docx`, `.docm`, `.odt`, `.rtf`\n- **PowerPoint**: `.ppt`, `.pptx`, `.pptm`, `.odp`\n- **Spreadsheets**: `.xls`, `.xlsx`, `.xlsm`, `.ods`, `.csv`, `.tsv`\n\nJust install the dependency and LiteParse will automatically convert these formats to PDF for parsing:\n\n```bash\n# macOS\nbrew install --cask libreoffice\n\n# Ubuntu\u002FDebian\napt-get install libreoffice\n\n# Windows\nchoco install libreoffice-fresh # might require admin permissions\n```\n\n> _For Windows, you might need to add the path to the directory containing LibreOffice CLI executable (generally `C:\\Program Files\\LibreOffice\\program`) to the environment variables and re-start the machine._\n\n#### Images (via ImageMagick)\n- **Formats**: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.webp`, `.svg`\n\nJust install ImageMagick and LiteParse will convert images to PDF for parsing (with OCR):\n\n```bash\n# macOS\nbrew install imagemagick\n\n# Ubuntu\u002FDebian\napt-get install imagemagick\n\n# Windows\nchoco install imagemagick.app # might require admin permissions\n```\n\n## Environment Variables\n\n| Variable | Description |\n|----------|-------------|\n| `TESSDATA_PREFIX` | Path to a directory containing Tesseract `.traineddata` files. Used for offline\u002Fair-gapped environments where Tesseract.js cannot download language data from the internet. |\n| `LITEPARSE_TMPDIR` | Override the temp directory used for format conversion and intermediate files. Defaults to the OS temp directory (`os.tmpdir()`). Useful in containerized or read-only filesystem environments. |\n\n## Configuration\n\nYou can configure parsing options via CLI flags or a JSON config file. The config file allows you to set sensible defaults and override as needed.\n\n### Config File Example\n\nCreate a `liteparse.config.json` file:\n\n```json\n{\n  \"ocrLanguage\": \"en\",\n  \"ocrEnabled\": true,\n  \"maxPages\": 1000,\n  \"dpi\": 150,\n  \"outputFormat\": \"json\",\n  \"preciseBoundingBox\": true,\n  \"preserveVerySmallText\": false,\n  \"password\": \"optional_password\"\n}\n```\n\nFor HTTP OCR servers, just add `ocrServerUrl`:\n\n```json\n{\n  \"ocrServerUrl\": \"http:\u002F\u002Flocalhost:8828\u002Focr\",\n  \"ocrLanguage\": \"en\",\n  \"outputFormat\": \"json\"\n}\n```\n\nUse with:\n\n```bash\nlit parse document.pdf --config liteparse.config.json\n```\n\n## Development\n\nWe provide a fairly rich `AGENTS.md`\u002F`CLAUDE.md` that we recommend using to help with development + coding agents.\n\n```bash\n# Install dependencies\nnpm install\n\n# Build TypeScript (Linux\u002FmacOs)\nnpm run build\n\n# Build Typescript (Windows)\nnpm run build:windows\n\n# Watch mode\nnpm run dev\n\n# Test parsing\nnpm test\n```\n\n## License\n\nApache 2.0\n\n## Credits\n\nBuilt on top of:\n\n- [PDF.js](https:\u002F\u002Fgithub.com\u002Fmozilla\u002Fpdf.js) - PDF parsing engine\n- [Tesseract.js](https:\u002F\u002Fgithub.com\u002Fnaptha\u002Ftesseract.js) - In-process OCR engine\n- [EasyOCR](https:\u002F\u002Fgithub.com\u002FJaidedAI\u002FEasyOCR) - HTTP OCR server (optional)\n- [PaddleOCR](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleOCR) - HTTP OCR server (optional)\n- [Sharp](https:\u002F\u002Fgithub.com\u002Flovell\u002Fsharp) - Image processing\n","LiteParse 是一个快速、轻量且开源的文档解析工具，专注于PDF文件的高效解析。它基于PDF.js实现文本的空间解析，并提供灵活的OCR系统支持，包括内置Tesseract.js和可接入任意HTTP服务器的OCR服务，同时生成带有精确边界框信息的高质量页面截图。此外，LiteParse完全本地运行，无需依赖云端资源，支持多种输出格式（JSON和纯文本），适用于Linux、macOS及Windows平台。此工具非常适合需要在本地环境中处理PDF文档并提取其中文本信息的场景，如办公自动化、数据挖掘等。",2,"2026-06-11 03:50:50","high_star"]