[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75475":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":42,"readmeContent":43,"aiSummary":44,"trendingCount":16,"starSnapshotCount":16,"syncStatus":45,"lastSyncTime":46,"discoverSource":47},75475,"knowhere","Ontos-AI\u002Fknowhere","Ontos-AI","Knowhere extracts, parses, and outputs structured chunks ready for AI Agents and RAG.","https:\u002F\u002Fknowhereto.ai",null,"Python",1216,127,24,6,0,196,445,1160,588,19.32,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40,41],"agent","ai-agents","chromadb","claude","claude-code","cursor","elasticsearch","gemini","gpt","langchain","milvus","qdrant","rag","rag-pipeline","skills","2026-06-12 02:03:34","\u003Cimg width=\"1000\" height=\"233\" alt=\"20260506-102713\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F896e64d2-e50e-4158-b71c-bc69e11c7c65\" \u002F>\n\n\u003Ch1 align=\"center\">Prepare unstructured data for AI Agents\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F\">\n    \u003Cimg alt=\"Python Version\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-%3E%3D%203.11-3776AB.svg?style=for-the-badge&logo=python&logoColor=white&labelColor=000000\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere\u002Fstargazers\">\n    \u003Cimg alt=\"GitHub stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fontos-ai\u002Fknowhere?style=for-the-badge&logo=github&labelColor=000000\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere\u002Factions\">\n    \u003Cimg alt=\"Build Status\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002FOntos-AI\u002Fknowhere\u002Fpr-ci.yml?style=for-the-badge&labelColor=000000\">\n  \u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere\u002Fdiscussions\">\n    \u003Cimg alt=\"Join the community on GitHub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin%20the%20community-blueviolet.svg?style=for-the-badge&logo=GitHub&labelColor=000000&logoWidth=20\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fghcr.io\u002Fontos-ai\u002Fknowhere\">\n    \u003Cimg alt=\"Container Images\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCONTAINER%20IMAGES-2496ED.svg?style=for-the-badge&logo=docker&logoColor=white&labelColor=000000\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere\u002Fblob\u002Fmain\u002FLICENSE\">\n    \u003Cimg alt=\"License: Apache 2.0\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAPACHE%202.0-D97706.svg?style=for-the-badge&label=LICENSE&labelColor=000000\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  🔗 \u003Ca href=\"https:\u002F\u002Fknowhereto.ai\">Website\u003C\u002Fa> |\n  📄 \u003Ca href=\"https:\u002F\u002Fdocs.knowhereto.ai\u002F\">Docs\u003C\u002Fa> |\n  🏠 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere-self-hosted\">Self-Host\u003C\u002Fa> |\n  🖥️ \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere-dashboard\">Dashboard\u003C\u002Fa>\n\u003C\u002Fp>\n\n## Overview\n\n**Knowhere is the memory layer between complex, dirty documents and AI agents.**\n\nIt ingests unstructured documents and produces persistent, navigable memory: parsing, hierarchy extraction, multi-modal structuring, and graph construction in a single pipeline. Every chunk retains full semantic context, making the output a natural fit for *Agentic RAG*, *vector-based RAG*, or any LLM workflow.\n\n> [!NOTE]\n> **Get started in seconds with Knowhere Cloud.**\n> Avoid the complexity of self-deployment. Use our managed API at [knowhereto.ai](https:\u002F\u002Fknowhereto.ai) and enjoy **$5 in free credits** upon registration.\n\n## 📢 News\n\n- **May 7, 2026**: 🚀 **Knowhere is now Open Source!** We have open-sourced our entire stack for document ingestion, parsing, and agentic RAG. You can now self-host the full platform using [knowhere-self-hosted](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere-self-hosted). Check out our [Contribution Guide](CONTRIBUTING.md) to get involved!\n\n## How it Works\n\nKnowhere runs in two steps: build memory from documents, then let agents retrieve from it.\n\n### Step 1: Parse and Build Memory\n\n\u003Cp align=\"center\">\n  \u003Cimg alt=\"Step 1: Parse and Build Memory\" src=\"docs\u002Fassets\u002Fstep-1-parse-build-memory.png\" width=\"900\">\n\u003C\u002Fp>\n\n- **Parse**: Route PDFs, Office files, images, tables, Markdown, and text to specialized parsers.\n- **Structure**: Our proprietary Tree-like algorithm reconstructs the full document hierarchy instead of flattening it into a sequence, preventing semantic fragmentation across chunks.\n- **Build Memory**: Store chunks, navigation trees, summaries, and graph links as agent-ready context.\n\n### Step 2: Agentic Retrieval\n\n\u003Cp align=\"center\">\n  \u003Cimg alt=\"Step 2: Agentic Retrieval\" src=\"docs\u002Fassets\u002Fstep-2-agentic-retrieval.png\" width=\"900\">\n\u003C\u002Fp>\n\n- **Discover**: Fuse keyword, path, content, and semantic signals for broad first-pass coverage.\n- **Navigate**: Walk section trees and graph links to drill into the most relevant document regions.\n- **Cite Evidence**: Return traceable results with source document, section, chunk, and linked assets.\n\n## FAQ\n\n**Q: What is Knowhere's relationship with MinerU?**\n\nA: Knowhere uses MinerU as its default parser because it performs best in our tests. Any parser only gets you raw Markdown. Knowhere's value is what comes after: hierarchy reconstruction, multi-modal normalization, and cross-document graph construction. Any Markdown-outputting tool works.\n\n**Q: What LLM \u002F VLM dependencies does Knowhere have?**\n\nA: By default, DeepSeek (`deepseek-chat`) handles text and table summarization, and Qwen-VL (`qwen3.5-flash`) handles image OCR and descriptions. Knowhere is model-agnostic. Swap in OpenAI, DashScope, Zhipu, or Volcengine via environment variables.\n\n**Q: How is Agentic Retrieval different from traditional RAG?**\n\nA: Traditional RAG does a flat vector lookup and returns isolated snippets. Knowhere's agents navigate the document's section tree and cross-document graph, drilling into the most relevant regions the way a human reader would, returning traceable, well-contextualized evidence.\n\n**Q: Does it handle images and tables?**\n\nA: Yes. Knowhere extracts them, runs them through VLMs for summarization and feature extraction, and links them back to their source chunks so agents can retrieve and cite multi-modal assets at inference time.\n\n## Performance Benchmark\n\nAgents using Knowhere outperform those working from raw documents or MinerU-parsed output on real-world tasks: searching, modifying, and answering questions.\n\n\u003Cp align=\"center\">\n  \u003Cimg alt=\"Benchmark Performance: Agent + Knowhere vs Others\" src=\"docs\u002Fassets\u002Fbenchmark.png\" width=\"900\">\n\u003C\u002Fp>\n\n> **We're not developing the next MinerU — we're building document memory infrastructure that agents can effectively consume.**\n\n### Key Advantages\n\n- **Accuracy**: +36% first-try accuracy and +10% recall over raw documents.\n- **Reliability**: 79% accuracy with feedback, vs. a ~53% ceiling on raw docs.\n- **Efficiency**: Fewer loops, fewer tokens, less time. Agents navigate a structured graph instead of reading monolithic text.\n\n*(Internal evaluation across identical agentic RAG tasks. Baseline: MinerU output fed directly to agents.)*\n\n> [!NOTE]\n> **📊 Benchmarks are actively expanding.** More parsers and retrieval baselines coming soon.\n\n## Ecosystem\n\n| Repository | Description |\n|---|---|\n| [knowhere](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere) | **This repo.** Backend API and worker: document ingestion, parsing, graph construction, and retrieval. |\n| 🖥️ [knowhere-dashboard](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere-dashboard) | The web UI. Connects to the API for the full product experience. |\n| 🐳 [knowhere-self-hosted](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere-self-hosted) | Docker Compose stack for self-hosted deployments. Packages the API, worker, and dashboard together. |\n| 🐍 [knowhere-python-sdk](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere-python-sdk) | Official Python SDK for the Knowhere Cloud API. |\n| 🦕 [knowhere-node-sdk](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere-node-sdk) | Official Node.js SDK for the Knowhere Cloud API. |\n\n## Features\n\n- **Multi-modal Parsing**: High-fidelity extraction from PDF, Office, and images, preserving headings, tables, and hierarchical paths.\n- **Lightweight Memory Graph**: Context-aware organization that links documents and chunks for better relationship understanding.\n- **Agentic RAG**: A hybrid retrieval engine combining traditional search (RRF) with autonomous agent navigation.\n- **Evidence-based Citations**: Every result is backed by traceable source paths, ensuring reliability for AI Agent decision-making.\n\n## Supported Formats\n\n**✅ Supported**\n\n- [x] `.pdf` `.docx` `.pptx` `.xlsx` `.csv`\n- [x] `.jpg` `.png`\n- [x] `.md` `.txt` `.json`\n\n**⏳ Coming Soon**\n\n- [ ] `.epub` `.html` `.xml`\n- [ ] `.mp4` `.mp3`\n- [ ] `.skills.md`\n\nWant to see a new format supported? Adding a parser is a great first contribution. Check out [CONTRIBUTING.md](CONTRIBUTING.md) to get started.\n\n## Prerequisites\n\n- Python 3.11+\n- `uv`\n- Docker with `docker compose`\n\n## Quick Start\n\n1. Sync the workspace dependencies:\n\n```bash\nuv sync --all-packages\n```\n\n2. Copy the environment examples:\n\n```bash\ncp apps\u002Fapi\u002F.env.example apps\u002Fapi\u002F.env\ncp apps\u002Fworker\u002F.env.example apps\u002Fworker\u002F.env\n```\n\n3. Update the copied `.env` files with the values you need for local work:\n\n- database and Redis connection settings\n- S3-compatible storage credentials\n- at least one LLM provider key: `DS_KEY`, `ALI_API_KEYS`, `GPT_API_KEY`, or `GLM_API_KEY`\n- `MINERU_API_KEYS` if you need PDF parsing\n- a vision-capable model provider if you need image summaries, OCR, atlas classification, or image-aware retrieval\n- any optional billing or webhook providers you want to enable\n\nMost parser and retrieval tuning values have code defaults. Start with the\nrequired external services first, then override model names, provider URLs,\nbudgets, or concurrency limits only when your deployment needs different\nbehavior. See [docs\u002Fexternal-services.md](docs\u002Fexternal-services.md) for the\nfull dependency matrix.\n\n4. Start the local infrastructure stack:\n\n```bash\n.\u002Fdeploy\u002Flocal-dev\u002Fstart-dev.sh\n```\n\n5. Start the API and worker in separate terminals:\n\n```bash\ncd apps\u002Fapi && uv run main.py\ncd apps\u002Fworker && uv run worker.py\n```\n\nThe API runs migrations during startup.\n\nFor API-only development without the dashboard, create an API-only user\u002Fkey\nafter the API service starts:\n\n```bash\ncd apps\u002Fapi\nuv run scripts\u002Finit_user.py --email you@example.com\n```\n\nIf you plan to use the dashboard, register through the dashboard instead of\nusing `scripts\u002Finit_user.py`.\n\nThe API is now running at `http:\u002F\u002Flocalhost:5005`. If you want the full product experience with a UI, run the [knowhere-dashboard](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere-dashboard) alongside it; it connects to this API out of the box.\n\n## Quality Checks\n\nRun lint checks from the repository root:\n\n```bash\nmake lint\n```\n\nApply safe Ruff fixes:\n\n```bash\nmake lint-fix\n```\n\nRun type checks across the API, worker, and shared source code:\n\n```bash\nmake typecheck\n```\n\nRun both lint and type checks:\n\n```bash\nmake check\n```\n\n## Local Endpoints\n\n- API: `http:\u002F\u002Flocalhost:5005`\n- OpenAPI docs: `http:\u002F\u002Flocalhost:5005\u002Fdocs`\n- LocalStack: `http:\u002F\u002Flocalhost:4566`\n- PostgreSQL: `localhost:5432`\n- Redis: `localhost:6379`\n\n## Additional Guides\n\n- External dependency guide:\n  [docs\u002Fexternal-services.md](docs\u002Fexternal-services.md)\n\n## Citation\n\nIf you use Knowhere in your research, please cite it as:\n\n```bibtex\n@software{knowhere2026,\n  author       = {Ontos AI},\n  title        = {Knowhere: Prepare Unstructured Data for AI Agents},\n  year         = {2026},\n  publisher    = {GitHub},\n  url          = {https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere},\n  version      = {2026.04.30.1},\n  license      = {Apache-2.0}\n}\n```\n\n## Communication\n\n- [GitHub Discussions](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere\u002Fdiscussions) for questions, ideas, and general conversation.\n- [GitHub Issues](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere\u002Fissues) for bug reports and feature requests.\n\n## Contribution\n\nAny contributions to Knowhere are more than welcome!\n\nIf you are new to the project, check out the [good first issues](https:\u002F\u002Fgithub.com\u002FOntos-AI\u002Fknowhere\u002Fissues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22). They are well-defined, relatively simple, and a great way to get familiar with the codebase and the contribution workflow.\n\nFor general guidelines on branching, commit conventions, and the review process, take a look at [CONTRIBUTING.md](CONTRIBUTING.md).\n\nOther useful references:\n\n- [SECURITY.md](SECURITY.md): how to report vulnerabilities responsibly.\n- [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md): community behavior expectations.\n- [LICENSE](LICENSE) and [NOTICE](NOTICE): Apache 2.0.\n\n## 👋 We're Hiring!\n\nWe're building the knowledge layer for the Agent era. If that sounds like work you want to do, reach out. Decode the address below and drop us a line:\n\n```bash\necho 'dGVhbUBrbm93aGVyZXRvLmFp' | base64 --decode\n```\n","Knowhere 是一个用于从非结构化文档中提取、解析并输出结构化数据块的工具，旨在为AI代理和检索增强生成（RAG）提供准备好的信息。其核心功能包括文档解析、层次结构提取、多模态结构化处理以及图构建等，所有这些都在一个流水线中完成，确保每个数据块保留完整的语义上下境。技术上，Knowhere支持多种数据库和向量存储解决方案如ChromaDB、Elasticsearch、Milvus等，并且能够无缝集成到任何基于大语言模型的工作流中。适用于需要将复杂或脏数据转换成易于机器理解和处理格式的各种场景，比如知识管理、文档自动化处理及智能问答系统等。",2,"2026-06-11 03:52:54","CREATED_QUERY"]