[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72204":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":36,"readmeContent":37,"aiSummary":38,"trendingCount":16,"starSnapshotCount":16,"syncStatus":39,"lastSyncTime":40,"discoverSource":41},72204,"chonkie","chonkie-inc\u002Fchonkie","chonkie-inc","🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust RAG pipelines","https:\u002F\u002Fdocs.chonkie.ai",null,"Python",4139,277,12,22,0,7,27,135,21,29.33,"MIT License",false,"main",[26,5,27,28,29,30,31,32,33,34,35],"ai","chunker","chunking-algorithm","llms","rag","retrieval-systems","semantic-chunker","similarity-search","splitting-algorithms","text-splitter","2026-06-12 02:03:00","\u003Cdiv align='center'>\n\n![Chonkie Logo](https:\u002F\u002Fgithub.com\u002Fchonkie-inc\u002Fchonkie\u002Fblob\u002Fmain\u002Fassets\u002Fchonkie_logo_br_transparent_bg.png?raw=true)\n\n# 🦛 Chonkie ✨\n\n[![PyPI version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fchonkie.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fchonkie\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fchonkie-inc\u002Fchonkie.svg)](https:\u002F\u002Fgithub.com\u002Fchonkie-inc\u002Fchonkie\u002Fblob\u002Fmain\u002FLICENSE)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-chonkie.ai-blue.svg)](https:\u002F\u002Fdocs.chonkie.ai)\n[![Package size](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsize-505KB-blue)](https:\u002F\u002Fgithub.com\u002Fchonkie-inc\u002Fchonkie\u002Fblob\u002Fmain\u002FREADME.md#installation)\n[![codecov](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fchonkie-inc\u002Fchonkie\u002Fgraph\u002Fbadge.svg?token=V4EWIJWREZ)](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fchonkie-inc\u002Fchonkie)\n[![Downloads](https:\u002F\u002Fstatic.pepy.tech\u002Fbadge\u002Fchonkie)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fchonkie)\n[![Discord](https:\u002F\u002Fdcbadge.limes.pink\u002Fapi\u002Fserver\u002Fhttps:\u002F\u002Fdiscord.gg\u002FvH3SkRqmUz?style=flat)](https:\u002F\u002Fdiscord.gg\u002FvH3SkRqmUz)\n[![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchonkie-inc\u002Fchonkie.svg)](https:\u002F\u002Fgithub.com\u002Fchonkie-inc\u002Fchonkie\u002Fstargazers)\n\n_The lightweight ingestion library for fast, efficient and robust RAG pipelines_\n\n[Installation](#📦-installation) •\n[Usage](#🚀-usage) •\n[Chunkers](#✂️-chunkers) •\n[Integrations](#🔌-integrations) •\n[Benchmarks](#📊-benchmarks)\n\n\u003C\u002Fdiv>\n\nTired of making your gazillionth chunker? Sick of the overhead of large libraries? Want to chunk your texts quickly and efficiently? Chonkie the mighty hippo is here to help!\n\n**🚀 Feature-rich**: All the CHONKs you'd ever need \u003C\u002Fbr>\n**🔄 End-to-end**: Fetch, CHONK, refine, embed and ship straight to your vector DB! \u003C\u002Fbr>\n**✨ Easy to use**: Install, Import, CHONK \u003C\u002Fbr>\n**⚡ Fast**: CHONK at the speed of light! zooooom \u003C\u002Fbr>\n**🪶 Light-weight**: No bloat, just CHONK \u003C\u002Fbr>\n**🔌 32+ [integrations](#integrations)**: Works with your favorite tools and vector DBs out of the box! \u003C\u002Fbr>\n**💬 ️Multilingual**: Out-of-the-box support for 56 languages \u003C\u002Fbr>\n**☁️ Cloud-Friendly**: CHONK locally or in the [Cloud](https:\u002F\u002Flabs.chonkie.ai) \u003C\u002Fbr>\n**🦛 Cute CHONK mascot**: psst it's a pygmy hippo btw \u003C\u002Fbr>\n**❤️ [Moto Moto](#acknowledgements)'s favorite python library** \u003C\u002Fbr>\n\n**Chonkie** is a chunking library that \"**just works**\" ✨\n\n## 📦 Installation\n\n### Basic Installation\n\nUsing pip:\n\n```bash\npip install chonkie\n```\n\nOr using [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) (faster):\n\n```bash\nuv pip install chonkie\n```\n\n### Full Installation\n\nChonkie follows the rule of minimum installs.\nHave a favorite chunker? Read our [docs](https:\u002F\u002Fdocs.chonkie.ai) to install only what you need.\nDon't want to think about it? Simply install `all` (Not recommended for production environments).\n\nUsing pip:\n\n```bash\npip install \"chonkie[all]\"\n```\n\nOr using [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F):\n\n```bash\nuv pip install \"chonkie[all]\"\n```\n\n## 🚀 Usage\n\n### Basic Usage\n\nHere's a basic example to get you started:\n\n```python\n# First import the chunker you want from Chonkie\nfrom chonkie import RecursiveChunker\n\n# Initialize the chunker\nchunker = RecursiveChunker()\n\n# Chunk some text\nchunks = chunker(\"Chonkie is the goodest boi! My favorite chunking hippo hehe.\")\n\n# Access chunks\nfor chunk in chunks:\n    print(f\"Chunk: {chunk.text}\")\n    print(f\"Tokens: {chunk.token_count}\")\n```\n\n### Pipeline Usage\n\nYou can also use the `chonkie.Pipeline` to chain components together and handle complex workflows. Read more about pipelines in the [docs](https:\u002F\u002Fdocs.chonkie.ai\u002Foss\u002Fpipelines)!\n\n```python\nfrom chonkie import Pipeline\n\n# Create a pipeline with multiple chunking and refinement steps\npipe = (\n    Pipeline()\n    .chunk_with(\"recursive\", tokenizer=\"gpt2\", chunk_size=2048, recipe=\"markdown\")\n    .chunk_with(\"semantic\", chunk_size=512)\n    .refine_with(\"overlap\", context_size=128)\n    .refine_with(\"embeddings\", embedding_model=\"sentence-transformers\u002Fall-MiniLM-L6-v2\")\n)\n\n# CHONK some Texts!\ndoc = pipe.run(texts=\"Chonkie is the goodest boi! My favorite chunking hippo hehe.\")\n\n# Access the processed chunks in the `doc` object\nfor chunk in doc.chunks:\n    print(chunk.text)\n\n# Run asynchronously for high-throughput applications\nimport asyncio\n\nasync def main():\n    doc = await pipe.arun(texts=\"Chonkie runs fast!\")\n    print(len(doc.chunks))\n\nasyncio.run(main())\n```\n\nCheck out more usage examples in the [docs](https:\u002F\u002Fdocs.chonkie.ai)!\n\n## 🌐 API Server\n\nRun Chonkie as a self-hosted REST API for easy integration into any application:\n\n```bash\n# Install with API dependencies (includes catsu for multi-provider embeddings)\npip install \"chonkie[api,semantic,code,catsu]\"\n\n# Start the server using the CLI\nchonkie serve\n\n# Or with custom options\nchonkie serve --port 3000 --reload --log-level debug\n\n# Or directly with uvicorn\nuvicorn chonkie.api.main:app --host 0.0.0.0 --port 8000\n```\n\nOr use Docker:\n\n```bash\ndocker compose up\n```\n\nThe API provides endpoints for all chunkers, refineries, and **pipelines** — reusable workflow configurations stored in a local SQLite database.\n\n```bash\n# Create a reusable pipeline\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fpipelines \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"name\": \"rag-chunker\",\n    \"steps\": [\n      {\"type\": \"chunk\", \"chunker\": \"semantic\", \"config\": {\"chunk_size\": 512}},\n      {\"type\": \"refine\", \"refinery\": \"embeddings\", \"config\": {\"embedding_model\": \"text-embedding-3-small\"}}\n    ]\n  }'\n\n# List your pipelines\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fpipelines\n```\n\nInteractive documentation is available at `\u002Fdocs` when the server is running.\n\n## ✂️ Chunkers\n\nChonkie provides several chunkers to help you split your text efficiently for RAG applications. Here's a quick overview of the available chunkers:\n\n| Name               | Alias       | Description                                                                                                                |\n| ------------------ | ----------- | -------------------------------------------------------------------------------------------------------------------------- |\n| `TokenChunker`     | `token`     | Splits text into fixed-size token chunks.                                                                                  |\n| `FastChunker`      | `fast`      | SIMD-accelerated byte-based chunking at 100+ GB\u002Fs. Included in the default install.                                        |\n| `SentenceChunker`  | `sentence`  | Splits text into chunks based on sentences.                                                                                |\n| `RecursiveChunker` | `recursive` | Splits text hierarchically using customizable rules to create semantically meaningful chunks.                              |\n| `SemanticChunker`  | `semantic`  | Splits text into chunks based on semantic similarity. Inspired by the work of [Greg Kamradt](https:\u002F\u002Fgithub.com\u002Fgkamradt). |\n| `LateChunker`      | `late`      | Embeds text and then splits it to have better chunk embeddings.                                                            |\n| `CodeChunker`      | `code`      | Splits code into structurally meaningful chunks.                                                                           |\n| `NeuralChunker`    | `neural`    | Splits text using a neural model.                                                                                          |\n| `SlumberChunker`   | `slumber`   | Splits text using an LLM to find semantically meaningful chunks. Also known as _\"AgenticChunker\"_.                         |\n\nMore on these methods and the approaches taken inside the [docs](https:\u002F\u002Fdocs.chonkie.ai)\n\n## 🔌 Integrations\n\nChonkie boasts 32+ integrations across tokenizers, embedding providers, LLMs, refineries, porters, vector databases, and utilities, ensuring it fits seamlessly into your existing workflow.\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>👨‍🍳 Chefs & 📁 Fetchers! Text preprocessing and data loading!\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nChefs handle text preprocessing, while Fetchers load data from various sources.\n\n| Component | Class         | Description                           | Optional Install |\n| --------- | ------------- | ------------------------------------- | ---------------- |\n| `chef`    | `TextChef`    | Text preprocessing and cleaning.      | `default`        |\n| `fetcher` | `FileFetcher` | Load text from files and directories. | `default`        |\n\n\u003C\u002Fdetails>\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🏭 Refine your CHONKs with Context and Embeddings! Chonkie supports 2+ refineries!\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nRefineries help you post-process and enhance your chunks after initial chunking.\n\n| Refinery Name | Class                | Description                                   | Optional Install    |\n| ------------- | -------------------- | --------------------------------------------- | ------------------- |\n| `overlap`     | `OverlapRefinery`    | Merge overlapping chunks based on similarity. | `default`           |\n| `embeddings`  | `EmbeddingsRefinery` | Add embeddings to chunks using any provider.  | `chonkie[semantic]` |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🐴 Exporting CHONKs! Chonkie supports 2+ Porters!\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nPorters help you save your chunks easily.\n\n| Porter Name | Class            | Description                            | Optional Install    |\n| ----------- | ---------------- | -------------------------------------- | ------------------- |\n| `json`      | `JSONPorter`     | Export chunks to a JSON file.          | `default`           |\n| `datasets`  | `DatasetsPorter` | Export chunks to HuggingFace datasets. | `chonkie[datasets]` |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🤝 Shake hands with your DB! Chonkie connects with 8+ vector stores!\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nHandshakes provide a unified interface to ingest chunks directly into your favorite vector databases.\n\n| Handshake Name | Class                  | Description                                  | Optional Install    |\n| -------------- | ---------------------- | -------------------------------------------- | ------------------- |\n| `chroma`       | `ChromaHandshake`      | Ingest chunks into ChromaDB.                 | `chonkie[chroma]`   |\n| `elastic`      | `ElasticHandshake`     | Ingest chunks into Elasticsearch.            | `chonkie[elastic]`  |\n| `mongodb`      | `MongoDBHandshake`     | Ingest chunks into MongoDB.                  | `chonkie[mongodb]`  |\n| `pgvector`     | `PgvectorHandshake`    | Ingest chunks into PostgreSQL with pgvector. | `chonkie[pgvector]` |\n| `pinecone`     | `PineconeHandshake`    | Ingest chunks into Pinecone.                 | `chonkie[pinecone]` |\n| `qdrant`       | `QdrantHandshake`      | Ingest chunks into Qdrant.                   | `chonkie[qdrant]`   |\n| `turbopuffer`  | `TurbopufferHandshake` | Ingest chunks into Turbopuffer.              | `chonkie[tpuf]`     |\n| `weaviate`     | `WeaviateHandshake`    | Ingest chunks into Weaviate.                 | `chonkie[weaviate]` |\n\n\u003C\u002Fdetails>\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🪓 Slice 'n' Dice! Chonkie supports 5+ ways to tokenize! \u003C\u002Fstrong>\u003C\u002Fsummary>\n\nChoose from supported tokenizers or provide your own custom token counting function. Flexibility first!\n\n| Name           | Description                                                    | Optional Install      |\n| -------------- | -------------------------------------------------------------- | --------------------- |\n| `character`    | Basic character-level tokenizer. **Default tokenizer.**        | `default`             |\n| `word`         | Basic word-level tokenizer.                                    | `default`             |\n| `byte`         | Byte-level tokenizer operating on UTF-8 encoded bytes.         | `default`             |\n| `tokenizers`   | Load any tokenizer from the Hugging Face `tokenizers` library. | `chonkie[tokenizers]` |\n| `tiktoken`     | Use OpenAI's `tiktoken` library (e.g., for `gpt-4`).           | `chonkie[tiktoken]`   |\n| `transformers` | Load tokenizers via `AutoTokenizer` from HF `transformers`.    | `chonkie[neural]`     |\n\n`default` indicates that the feature is available with the default `pip install chonkie`.\n\nTo use a custom token counter, you can pass in any function that takes a string and returns an integer! Something like this:\n\n```python\ndef custom_token_counter(text: str) -> int:\n    return len(text)\n\nchunker = RecursiveChunker(tokenizer=custom_token_counter)\n```\n\nYou can use this to extend Chonkie to support any tokenization scheme you want!\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🧠 Embed like a boss! Chonkie links up with 9+ embedding pals!\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nSeamlessly works with various embedding model providers. Bring your favorite embeddings to the CHONK party! Use `AutoEmbeddings` to load models easily.\n\n| Provider \u002F Alias        | Class                           | Description                            | Optional Install        |\n| ----------------------- | ------------------------------- | -------------------------------------- | ----------------------- |\n| `model2vec`             | `Model2VecEmbeddings`           | Use `Model2Vec` models.                | `chonkie[model2vec]`    |\n| `sentence-transformers` | `SentenceTransformerEmbeddings` | Use any `sentence-transformers` model. | `chonkie[st]`           |\n| `openai`                | `OpenAIEmbeddings`              | Use OpenAI's embedding API.            | `chonkie[openai]`       |\n| `azure-openai`          | `AzureOpenAIEmbeddings`         | Use Azure OpenAI embedding service.    | `chonkie[azure-openai]` |\n| `cohere`                | `CohereEmbeddings`              | Use Cohere's embedding API.            | `chonkie[cohere]`       |\n| `gemini`                | `GeminiEmbeddings`              | Use Google's Gemini embedding API.     | `chonkie[gemini]`       |\n| `jina`                  | `JinaEmbeddings`                | Use Jina AI's embedding API.           | `chonkie[jina]`         |\n| `voyageai`              | `VoyageAIEmbeddings`            | Use Voyage AI's embedding API.         | `chonkie[voyageai]`     |\n| `litellm`               | `LiteLLMEmbeddings`             | Use LiteLLM for 100+ embedding models. | `chonkie[litellm]`      |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🧞‍♂️ Power Up with Genies! Chonkie supports 5+ LLM providers!\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nGenies provide interfaces to interact with Large Language Models (LLMs) for advanced chunking strategies or other tasks within the pipeline.\n\n| Genie Name     | Class              | Description                                | Optional Install        |\n| -------------- | ------------------ | ------------------------------------------ | ----------------------- |\n| `gemini`       | `GeminiGenie`      | Interact with Google Gemini APIs.          | `chonkie[gemini]`       |\n| `openai`       | `OpenAIGenie`      | Interact with OpenAI APIs.                 | `chonkie[openai]`       |\n| `azure-openai` | `AzureOpenAIGenie` | Interact with Azure OpenAI APIs.           | `chonkie[azure-openai]` |\n| `groq`         | `GroqGenie`        | Fast inference on Groq hardware.           | `chonkie[groq]`         |\n| `cerebras`     | `CerebrasGenie`    | Fastest inference on Cerebras hardware.    | `chonkie[cerebras]`     |\n\nYou can also use the `OpenAIGenie` to interact with any LLM provider that supports the OpenAI API format, by simply changing the `model`, `base_url`, and `api_key` parameters. For example, here's how to use the `OpenAIGenie` to interact with the `Llama-4-Maverick` model via OpenRouter:\n\n```python\nfrom chonkie import OpenAIGenie\n\ngenie = OpenAIGenie(model=\"meta-llama\u002Fllama-4-maverick\",\n                    base_url=\"https:\u002F\u002Fopenrouter.ai\u002Fapi\u002Fv1\",\n                    api_key=\"your_api_key\")\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🛠️ Utilities & Helpers! Chonkie includes handy tools!\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nAdditional utilities to enhance your chunking workflow.\n\n| Utility Name | Class        | Description                                    | Optional Install |\n| ------------ | ------------ | ---------------------------------------------- | ---------------- |\n| `hub`        | `Hubbie`     | Simple wrapper for HuggingFace Hub operations. | `chonkie[hub]`   |\n| `viz`        | `Visualizer` | Rich console visualizations for chunks.        | `chonkie[viz]`   |\n\n\u003C\u002Fdetails>\n\nWith Chonkie's wide range of integrations, you can easily plug it into your existing infrastructure and start CHONKING!\n\n## 🤖 AI Agent Skills & Plugins\n\nChonkie provides an official skill and plugin for AI coding agents, giving them deep knowledge of Chonkie's API, chunking strategies, and pipeline patterns — so they can help you build RAG pipelines faster.\n\n**Supported agents:** Claude Code, Cursor, Gemini CLI, and more.\n\n```bash\n# Via skills.sh (works with Claude Code, Cursor, Copilot, and 20+ agents)\nnpx skills add chonkie-inc\u002Fskills\n\n# Claude Code only\n\u002Fplugin marketplace add chonkie-inc\u002Fskills\n```\n\nOnce installed, your agent gains knowledge of all chunkers, the Pipeline API, tokenizer selection, embeddings refineries, vector DB handshakes, the REST API server, recipes, and async\u002Fbatch processing patterns.\n\nLearn more at [github.com\u002Fchonkie-inc\u002Fskills](https:\u002F\u002Fgithub.com\u002Fchonkie-inc\u002Fskills).\n\n## 📊 Benchmarks\n\n> \"I may be smol hippo, but I pack a big punch!\" 🦛\n\nChonkie is not just cute, it's also fast and efficient! Here's how it stacks up against the competition:\n\n**Size**📦\n\n- **Wheel Size:** 505KB (vs 1-12MB for alternatives)\n- **Installed Size:** 49MB (vs 80-171MB for alternatives)\n- **With Semantic:** Still 10x lighter than the closest competition!\n\n**Speed**⚡\n\n- **Token Chunking:** 33x faster than the slowest alternative\n- **Sentence Chunking:** Almost 2x faster than competitors\n- **Semantic Chunking:** Up to 2.5x faster than others\n\nCheck out our detailed [benchmarks](BENCHMARKS.md) to see how Chonkie races past the competition! 🏃‍♂️💨\n\n## 🤝 Contributing\n\nWant to help grow Chonkie? Check out [CONTRIBUTING.md](CONTRIBUTING.md) to get started! Whether you're fixing bugs, adding features, or improving docs, every contribution helps make Chonkie a better CHONK for everyone.\n\nRemember: No contribution is too small for this tiny hippo! 🦛\n\n## 🙏 Acknowledgements\n\nChonkie would like to CHONK its way through a special thanks to all the users and contributors who have helped make this library what it is today! Your feedback, issue reports, and improvements have helped make Chonkie the CHONKIEST it can be.\n\nAnd of course, special thanks to [Moto Moto](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=I0zZC4wtqDQ&t=5s) for endorsing Chonkie with his famous quote:\n\n> \"I like them big, I like them chonkie.\" ~ Moto Moto\n\n## 📝 Citation\n\nIf you use Chonkie in your research, please cite it as follows:\n\n```bibtex\n@software{chonkie2025,\n  author = {Minhas, Bhavnick AND Nigam, Shreyash},\n  title = {Chonkie: The lightweight ingestion library for fast, efficient and robust RAG pipelines},\n  year = {2025},\n  publisher = {GitHub},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fchonkie-inc\u002Fchonkie}},\n}\n```\n","Chonkie 是一个轻量级的文本处理库，专为快速、高效和稳定的RAG（Retrieval-Augmented Generation）流水线设计。它提供了多种文本分割算法，支持56种语言，并且能够与超过32种工具和向量数据库无缝集成。其核心功能包括高效的文本分割、多语言支持以及对各种流行工具和云服务的兼容性。特别适合需要处理大量文本数据并进行语义搜索或信息检索的应用场景，如构建智能客服系统、文档管理平台或知识图谱等。",2,"2026-06-11 03:40:51","high_star"]