[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72035":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},72035,"unstract","Zipstack\u002Funstract","Zipstack","LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows","https:\u002F\u002Funstract.com",null,"Python",6642,630,44,37,0,7,25,76,21,39.4,"GNU Affero General Public License v3.0",false,"main",true,[27,28,29,30,31,32],"api-deployments","data-extraction","document-processing","etl-pipelines","open-source-data-pipeline","unstructured-data-extraction","2026-06-12 02:02:57","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Funstract_u_logo.png\" style=\"height: 120px\">\n  \u003Ch1>Unstract\u003C\u002Fh1>\n  \u003Ch2>Turn Unstructured Documents into Structured Data\u003C\u002Fh2>\n  \u003Cp>\n    \u003Ca href=\"https:\u002F\u002Fdocs.unstract.com\">Documentation\u003C\u002Fa> |\n    \u003Ca href=\"https:\u002F\u002Funstract.com\u002Fpricing\u002F\">Enterprise\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Cp>\n    \u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FZipstack\u002Funstract\" alt=\"License\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fdocs.unstract.com\u002Funstract\u002Funstract_platform\u002Fquick_start\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftutorials-docs-brightgreen\" alt=\"Tutorials\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fstatus.unstract.com\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fuptime-status-brightgreen\" alt=\"Uptime Status\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhub.docker.com\u002Fu\u002Funstract\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fdocker\u002Fpulls\u002Funstract\u002Fbackend\" alt=\"Docker Pulls\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fdeepwiki.com\u002FZipstack\u002Funstract\">\u003Cimg src=\"https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg\" alt=\"Ask DeepWiki\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fcla-assistant.io\u002FZipstack\u002Funstract\">\u003Cimg src=\"https:\u002F\u002Fcla-assistant.io\u002Freadme\u002Fbadge\u002FZipstack\u002Funstract\" alt=\"CLA assistant\">\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Cp>\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpython\u002Frequired-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2FZipstack%2Funstract%2Frefs%2Fheads%2Fmain%2Fpyproject.toml\" alt=\"Python Version from PEP 621 TOML\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fraw.githubusercontent.com\u002Fastral-sh\u002Fuv\u002Fmain\u002Fassets\u002Fbadge\u002Fv0.json\" alt=\"uv\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fvite.dev\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVite-6.x-646CFF?logo=vite&logoColor=white\" alt=\"Vite\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fbun.sh\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBun-1.x-000000?logo=bun&logoColor=white\" alt=\"Bun\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fbiomejs.dev\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBiome-2.x-60A5FA?logo=biome&logoColor=white\" alt=\"Biome\">\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Cp>\n    \u003Ca href=\"https:\u002F\u002Fresults.pre-commit.ci\u002Flatest\u002Fgithub\u002FZipstack\u002Funstract\u002Fmain\">\u003Cimg src=\"https:\u002F\u002Fresults.pre-commit.ci\u002Fbadge\u002Fgithub\u002FZipstack\u002Funstract\u002Fmain.svg\" alt=\"pre-commit.ci status\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fsonarcloud.io\u002Fsummary\u002Fnew_code?id=Zipstack_unstract\">\u003Cimg src=\"https:\u002F\u002Fsonarcloud.io\u002Fapi\u002Fproject_badges\u002Fmeasure?project=Zipstack_unstract&metric=alert_status\" alt=\"Quality Gate Status\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fsonarcloud.io\u002Fsummary\u002Fnew_code?id=Zipstack_unstract\">\u003Cimg src=\"https:\u002F\u002Fsonarcloud.io\u002Fapi\u002Fproject_badges\u002Fmeasure?project=Zipstack_unstract&metric=code_smells\" alt=\"Code Smells\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fsonarcloud.io\u002Fsummary\u002Fnew_code?id=Zipstack_unstract\">\u003Cimg src=\"https:\u002F\u002Fsonarcloud.io\u002Fapi\u002Fproject_badges\u002Fmeasure?project=Zipstack_unstract&metric=duplicated_lines_density\" alt=\"Duplicated Lines (%)\">\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## What is Unstract?\n\nUnstract uses LLMs to extract structured JSON from documents — PDFs, images, scans, you name it. Define what you want to extract using natural language prompts, and deploy as an API or ETL pipeline.\n\nBuilt for teams in **finance**, **insurance**, **healthcare**, **KYC\u002Fcompliance**, and much more.\n\n## Current State vs. Unstract\n\n| Task | Without Unstract | With Unstract |\n|------|------------------|---------------|\n| Schema definition | Write regex, build templates per vendor | Write a prompt once, handles variations |\n| New document type | Days of development | Minutes in Prompt Studio |\n| LLM integration | Build your own pipeline | Plug in any provider (OpenAI, Anthropic, Bedrock, Ollama) |\n| Deployment | Custom infrastructure | `.\u002Frun-platform.sh` or managed cloud |\n| Output | Unstructured text blobs | Clean JSON, ready for your database |\n\n> ⭐ If Unstract helps you, star this repo!\n>\n> ![Star Unstract](docs\u002Fassets\u002Fgithub_star.gif)\n\n## ✨ Key Features\n\n**Prompt Studio** — Define document extraction schemas with natural language. [Docs →](https:\u002F\u002Fdocs.unstract.com\u002Funstract\u002Funstract_platform\u002Ffeatures\u002Fprompt_studio\u002Fprompt_studio_intro\u002F)\n\n![Prompt Studio](docs\u002Fassets\u002Fprompt_studio.gif)\n\n**API Deployment** — Send a document over REST API, get JSON back. [Docs →](https:\u002F\u002Fdocs.unstract.com\u002Funstract\u002Funstract_platform\u002Fapi_deployment\u002Funstract_api_deployment_intro\u002F)\n\n![API Deployment](docs\u002Fassets\u002Fapi_deployment.gif)\n\n**ETL Pipeline** — Pull documents from a folder, process them, load to your warehouse. [Docs →](https:\u002F\u002Fdocs.unstract.com\u002Funstract\u002Funstract_platform\u002Fetl_pipeline\u002Funstract_etl_pipeline_intro\u002F)\n\n**MCP Server** — Connect to AI agents (Claude, etc.) via Model Context Protocol. [Docs →](https:\u002F\u002Fdocs.unstract.com\u002Funstract\u002Funstract_platform\u002Fmcp\u002Funstract_platform_mcp_server\u002F)\n\n**n8n Node** — Drop into existing automation workflows. [Docs →](https:\u002F\u002Fdocs.unstract.com\u002Funstract\u002Funstract_platform\u002Fapi_deployment\u002Funstract_api_deployment_n8n_custom_node\u002F)\n\n## 🚀 Quickstart (~5 mins)\n\n### System Requirements & Prerequisites\n\n- Linux or macOS (Intel or M-series)\n- Docker & Docker Compose\n- 8 GB RAM minimum\n- Git\n\n### Run Locally\n\n```bash\n# Clone and start\ngit clone https:\u002F\u002Fgithub.com\u002FZipstack\u002Funstract.git\ncd unstract\n.\u002Frun-platform.sh\n```\n\nThat's it!\n\n- Visit [http:\u002F\u002Ffrontend.unstract.localhost](http:\u002F\u002Ffrontend.unstract.localhost) in your browser\n- Login with username: `unstract` password: `unstract`\n- Start extracting data!\n\n## 📦 Other Deployment Options\n\n### Docker Compose\n\n```bash\n# Pull and run entire Unstract platform with default env config.\n.\u002Frun-platform.sh\n\n# Pull and run docker containers with a specific version tag.\n.\u002Frun-platform.sh -v v0.1.0\n\n# Upgrade existing Unstract platform setup by pulling the latest available version.\n.\u002Frun-platform.sh -u\n\n# Upgrade existing Unstract platform setup by pulling a specific version.\n.\u002Frun-platform.sh -u -v v0.2.0\n\n# Build docker images locally as a specific version tag.\n.\u002Frun-platform.sh -b -v v0.1.0\n\n# Build docker images locally from working branch as `current` version tag.\n.\u002Frun-platform.sh -b -v current\n\n# Display the help information.\n.\u002Frun-platform.sh -h\n\n# Only do setup of environment files.\n.\u002Frun-platform.sh -e\n\n# Only do docker images pull with a specific version tag.\n.\u002Frun-platform.sh -p -v v0.1.0\n\n# Only do docker images pull by building locally with a specific version tag.\n.\u002Frun-platform.sh -p -b -v v0.1.0\n\n# Upgrade existing Unstract platform setup with docker images built locally from working branch as `current` version tag.\n.\u002Frun-platform.sh -u -b -v current\n\n# Pull and run docker containers in detached mode.\n.\u002Frun-platform.sh -d -v v0.1.0\n```\n\n## 🔐 Backup Encryption Key\n\n> [!WARNING]\n> This key encrypts adapter credentials — losing it makes existing adapters inaccessible!\n\nCopy the value of `ENCRYPTION_KEY` from `backend\u002F.env` or `platform-service\u002F.env` to a secure location.\n\n## 🏗️ Unstract Architecture\n\n```text\n┌────────────────────────────────────────────────────────────┐\n│                          Unstract                          │\n├─────────────┬─────────────┬─────────────┬──────────────────┤\n│  Frontend   │   Backend   │   Worker    │ Platform Service │\n│  (React)    │  (Django)   │  (Celery)   │   (FastAPI)      │\n├─────────────┴─────────────┴─────────────┴──────────────────┤\n│                      Cache (Redis)                         │\n├────────────────────────────────────────────────────────────┤\n│                  Message Queue (RabbitMQ)                  │\n├────────────────────────────────────────────────────────────┤\n│                   Database (PostgreSQL)                    │\n├────────────────────────────────────────────────────────────┤\n│  LLM Adapters    │  Vector DBs    │  Text Extractors       │\n│  (OpenAI, etc.)  │ (Qdrant, etc.) │  (LLMWhisperer)        │\n└────────────────────────────────────────────────────────────┘\n```\n\nAlso see [architecture](docs\u002FARCHITECTURE.md).\n\n## 📄 Document File Formats\n\n| Category | Formats |\n|----------|---------|\n| Documents | PDF, DOCX, DOC, ODT, TXT, CSV, JSON |\n| Spreadsheets | XLSX, XLS, ODS |\n| Presentations | PPTX, PPT, ODP |\n| Images | PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP |\n\n## 🔌 Connectors & Adapters\n\n### LLM Providers\n\n| Provider | Status | Provider | Status |\n|----------|--------|----------|--------|\n| OpenAI | ✅ | Azure OpenAI | ✅ |\n| Anthropic Claude | ✅ | Google Gemini | ✅ |\n| AWS Bedrock | ✅ | Mistral AI | ✅ |\n| Ollama (local) | ✅ | Anyscale | ✅ |\n\n### Vector Databases\n\n| Provider | Status | Provider | Status |\n|----------|--------|----------|--------|\n| Qdrant | ✅ | Pinecone | ✅ |\n| Weaviate | ✅ | PostgreSQL | ✅ |\n| Milvus | ✅ | | |\n\n### Text Extractors\n\n| Provider | Status |\n|----------|--------|\n| LLMWhisperer | ✅ |\n| Unstructured.io | ✅ |\n| LlamaIndex Parse | ✅ |\n\n### ETL Sources & Destinations\n\n**Sources:** AWS S3, MinIO, Google Cloud Storage, Azure Blob, Google Drive, Dropbox, SFTP\n\n**Destinations:** Snowflake, Amazon Redshift, Google BigQuery, PostgreSQL, MySQL, MariaDB, SQL Server, Oracle\n\n[Full Connector List](https:\u002F\u002Fdocs.unstract.com\u002Funstract\u002Funstract_platform\u002Fsetup_accounts\u002Fwhats_needed)\n\n## 🛠️ Development\n\n### Change Default Credentials\n\nFollow [these steps](backend\u002FREADME.md#authentication) to change the default username and password.\n\n### Local Development\n\n```bash\n# Install pre-commit hooks\n.\u002Fdev-env-cli.sh -p\n\n# Run pre-commit checks\n.\u002Fdev-env-cli.sh -r\n```\n\n[Local Development Guide](https:\u002F\u002Fdocs.unstract.com\u002Funstract\u002Funstract_platform\u002Fuser_guides\u002Frun_platform)\n\n## 🏢 Use Cases by Industry\n\n[Finance & Banking →](https:\u002F\u002Funstract.com\u002Ffinance-automation\u002F) | [Insurance →](https:\u002F\u002Funstract.com\u002Finsurance-automation\u002F) | [Healthcare →](https:\u002F\u002Funstract.com\u002Fhealthcare-automation\u002F) | [Income Tax →](https:\u002F\u002Funstract.com\u002Fai-income-tax-forms-data-extraction\u002F)\n\n## ☁️ Cloud & Enterprise\n\nFor teams that need managed infrastructure, advanced accuracy features, or compliance certifications.\n\n- ✅ **LLMChallenge** — dual-LLM verification\n- ✅ **SinglePass & Summarized Extraction** — reduce LLM token costs\n- ✅ **Human-in-the-Loop** — review interface with document highlighting\n- ✅ **SSO & Enterprise RBAC** — SAML\u002FOIDC integration with granular role-based access control\n- ✅ **SOC 2, HIPAA, ISO 27001, GDPR Compliant** — third-party audited security certifications\n- ✅ **Priority Support with SLA** — dedicated support team with response time guarantees\n\n\u003Ca href=\"https:\u002F\u002Funstract.com\u002Fschedule-a-demo\u002F\">\u003Cimg src=\"docs\u002Fassets\u002Fbook-demo-button-blue.svg\" alt=\"Book a Demo\">\u003C\u002Fa>\n\n## 📚 Cookbooks\n\n- [Unstract + PostgreSQL + DeepSeek](https:\u002F\u002Funstract.com\u002Fblog\u002Fopen-source-document-data-extraction-with-unstract-deepseek\u002F)\n- [Unstract + n8n](https:\u002F\u002Funstract.com\u002Fblog\u002Funstract-n8n\u002F)\n- [Unstract + Snowflake](https:\u002F\u002Funstract.com\u002Fblog\u002Fprocess-unstructured-data-with-unstract-snowflake\u002F)\n- [Unstract + BigQuery](https:\u002F\u002Funstract.com\u002Fblog\u002Fprocess-unstructured-data-with-unstract-bigquery\u002F)\n- [Unstract + Crew.AI](https:\u002F\u002Funstract.com\u002Fblog\u002Fagentic-document-extraction-processing-with-unstract-crew-ai\u002F)\n- [Unstract + PydanticAI](https:\u002F\u002Funstract.com\u002Fblog\u002Fbuilding-real-world-ai-agents-with-pydanticai-and-unstract\u002F)\n- [Unstract MCP Server](https:\u002F\u002Funstract.com\u002Fblog\u002Funstract-mcp-server\u002F)\n\n## 🤝 Contributing\n\nWe welcome contributions! The easiest way to start:\n\n1. Pick an issue tagged [`good first issue`](https:\u002F\u002Fgithub.com\u002FZipstack\u002Funstract\u002Flabels\u002Fgood%20first%20issue)\n2. Submit a PR\n\n[Report Bug →](https:\u002F\u002Fgithub.com\u002FZipstack\u002Funstract\u002Fissues\u002Fnew?template=bug_report.md) | [Request Feature →](https:\u002F\u002Fgithub.com\u002FZipstack\u002Funstract\u002Fissues\u002Fnew?template=feature_request.md)\n\n## 👋 Community\n\nJoin the LLM-powered document automation community:\n\n[![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBLOG-FF6B6B?style=flat)](https:\u002F\u002Funstract.com\u002Fblog\u002F) [![LinkedIn](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FFOLLOW%20US%20ON%20LINKEDIN-C8A2E8?style=flat)](https:\u002F\u002Fwww.linkedin.com\u002Fshowcase\u002Funstract\u002F) [![Slack](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSLACK-4CAF50?style=flat)](https:\u002F\u002Fjoin-slack.unstract.com) [![X](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FFOLLOW%20US%20ON%20X-FFD700?style=flat)](https:\u002F\u002Ftwitter.com\u002FGetUnstract)\n\n## 📊 A Note on Analytics\n\nUnstract integrates Posthog to track minimal usage analytics. Disable by setting `REACT_APP_ENABLE_POSTHOG=false` in the frontend's `.env` file.\n\n## 📜 License\n\nUnstract is released under the [AGPL-3.0 License](LICENSE).\n\n---\n\n\u003Cdiv align=\"center\">\n  \u003Cp>Built with ❤️ by \u003Ca href=\"https:\u002F\u002Fzipstack.com\">Zipstack\u003C\u002Fa>\u003C\u002Fp>\n  \u003Cp>\n    \u003Ca href=\"https:\u002F\u002Funstract.com\">Website\u003C\u002Fa> ·\n    \u003Ca href=\"https:\u002F\u002Fdocs.unstract.com\">Documentation\u003C\u002Fa> ·\n    \u003Ca href=\"https:\u002F\u002Funstract.com\u002Fpricing\u002F\">Pricing\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n","Unstract 是一个利用大语言模型从非结构化文档（如PDF、图片等）中提取结构化数据的工具。其核心功能是通过自然语言提示定义所需提取的信息，并支持以API形式部署或集成到ETL工作流中。技术上，它基于Python开发，采用先进的机器学习技术来解析和转换文档内容。该项目适用于需要处理大量非结构化数据并希望将其转化为易于分析和使用的格式的企业和个人开发者，特别适合于自动化文档处理流程中的数据抽取环节。",2,"2026-06-11 03:40:04","high_star"]