[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72405":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":38,"readmeContent":39,"aiSummary":40,"trendingCount":16,"starSnapshotCount":16,"syncStatus":41,"lastSyncTime":42,"discoverSource":43},72405,"datachain","datachain-ai\u002Fdatachain","datachain-ai","The Context Layer for unstructured data: typed, versioned datasets over S3, GCS, Azure","https:\u002F\u002Fdocs.datachain.ai",null,"Python",2781,145,16,53,0,4,29,37,12,84.69,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37],"ai-agents","claude-code","codex","data-context-layer","data-memory","data-processing","harness-engineering","knowledge-base","mlops","multimodal","pydantic","unstructured-data","2026-06-12 04:01:05","# ![DataChain](docs\u002Fassets\u002Fdatachain.svg) DataChain: The Context Layer for Unstructured Data\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fdatachain.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fdatachain\u002F)\n[![Python Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fdatachain)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fdatachain)\n[![Codecov](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fdatachain-ai\u002Fdatachain\u002Fgraph\u002Fbadge.svg?token=byliXGGyGB)](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fdatachain-ai\u002Fdatachain)\n[![Tests](https:\u002F\u002Fgithub.com\u002Fdatachain-ai\u002Fdatachain\u002Factions\u002Fworkflows\u002Ftests.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fdatachain-ai\u002Fdatachain\u002Factions\u002Fworkflows\u002Ftests.yml)\n[![DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002Fdatachain-ai\u002Fdatachain)\n\n**A Python library that turns files in S3, GCS, and Azure into versioned, typed datasets, queryable at warehouse speed.**\n\n- **Compute Engine**: parallel and distributed Python over files. Async I\u002FO, checkpoint recovery, incremental updates.\n- **Dataset DB**: Pydantic schemas, versioning, file pointers, automatic lineage. Sub-second filter, join, and similarity search over hundreds of millions of records.\n\nOptional, for agent workflows:\n\n- **Knowledge Base**: markdown summaries derived from the Dataset DB and enriched by LLM. Readable by humans and LLMs.\n- **Agent Harness**: skill and MCP server that plug all three into Claude Code, Cursor, and Codex, so they understand your data.\n\nBytes never leave your storage. Every run deposits a typed dataset the next pipeline (or agent) reads instead of recomputing.\n\n## 1. Install\n\n```bash\npip install datachain\n```\n\nTo add the agent skill (Knowledge Base + code generation):\n\n```bash\ndatachain skill install --target claude     # also: --target cursor, --target codex\n```\n\nWorks with S3, GCS, Azure, and local filesystems.\n\n## 2. Quickstart: agent-driven pipeline\n\nTask: find dogs in S3 similar to a reference image, filtered by breed, mask availability, and image dimensions.\n\nGrab a reference image and run Claude Code (or other agent):\n```bash\ndatachain cp --anon s3:\u002F\u002Fdc-readme\u002Ffiona.jpg .\n\nclaude\n```\n\nPrompt:\n```prompt\nFind dogs in s3:\u002F\u002Fdc-readme\u002Foxford-pets-micro\u002F similar to fiona.jpg:\n  - Pull breed metadata and mask files from annotations\u002F\n  - Exclude images without mask\n  - Exclude Cocker Spaniels\n  - Only include images wider than 400px\n```\n\nResult:\n```\n  ┌──────┬───────────────────────────────────┬────────────────────────────┬──────────┐\n  │ Rank │               Image               │           Breed            │ Distance │\n  ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤\n  │    1 │ shiba_inu_52.jpg                  │ shiba_inu                  │    0.244 │\n  ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤\n  │    2 │ shiba_inu_53.jpg                  │ shiba_inu                  │    0.323 │\n  ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤\n  │    3 │ great_pyrenees_17.jpg             │ great_pyrenees             │    0.325 │\n  └──────┴───────────────────────────────────┴────────────────────────────┴──────────┘\n\n  Fiona's closest matches are shiba inus (both top spots), which makes sense given her\n  tan coloring and pointed ears.\n```\n\nThe agent decomposed the task into steps - embeddings, breed metadata, mask join, quality filter - and saved each as a named, versioned dataset. Next time you ask a related question, it starts from what's already built.\n\nThe datasets are registered in a Knowledge Base optimized for both agents and humans:\n\n```bash\ndc-knowledge\n├── buckets\n│   └── s3\n│       └── dc_readme.md\n├── datasets\n│   ├── oxford_micro_dog_breeds.md\n│   ├── oxford_micro_dog_embeddings.md\n│   └── similar_to_fiona.md\n└── index.md\n```\n\nBrowse it as markdown files, navigate with wikilinks, or open in [Obsidian](https:\u002F\u002Fobsidian.md\u002F):\n\n![Visualize data Knowledge Base](docs\u002Fassets\u002Freadme_obsidian.gif)\n\n\n## 3. Data Harness\n\nCode harnesses (Claude Code, Cursor, Codex) give agents repo context, dedicated tools, and memory across sessions. DataChain adds the same for data: typed datasets the agent reads, chain operations the agent calls (`read_storage`, `map`, `save`), a Dataset DB where its results persist.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Fharness.svg\" alt=\"DataChain as a data harness\" width=\"500\" \u002F>\n\u003C\u002Fp>\n\nA **dataset** is the unit of work - a named, versioned result of a pipeline step like `pets_embeddings@1.0.0`. Every `.save()` registers one.\n\nFor the data-flow architecture (Compute Engine, Dataset DB, Knowledge Base) and how the components connect, see [Architecture](https:\u002F\u002Fdocs.datachain.ai\u002Farchitecture\u002F).\n\n\n## 4. Core concepts\n\n### 4.1. Dataset\n\nA dataset is a versioned data reasoning step - what was computed, from what input, producing what schema. DataChain indexes your storage into one: no data copied, just typed metadata and file pointers. Re-runs only process new or changed files.\n\nCreate a dataset manually `create_dataset.py`:\n```python\nfrom PIL import Image\nimport io\nfrom pydantic import BaseModel\nimport datachain as dc\n\nclass ImageInfo(BaseModel):\n    width: int\n    height: int\n\ndef get_info(file: dc.File) -> ImageInfo:\n    img = Image.open(io.BytesIO(file.read()))\n    return ImageInfo(width=img.width, height=img.height)\n\nds = (\n    dc.read_storage(\n        \"s3:\u002F\u002Fdc-readme\u002Foxford-pets-micro\u002Fimages\u002F**\u002F*.jpg\",\n        anon=True,\n        update=True,\n        delta=True,         # re-runs skip unchanged files\n    )\n    .settings(prefetch=64)\n    .map(info=get_info)\n    .save(\"pets_images\")\n)\nds.show(5)\n```\n\n`pets_images@1.0.0` is now the shared reference to this data - schema, version, lineage, and metadata.\n\nEvery `.save()` registers the dataset in the **Dataset DB**, DataChain's persistent store for schemas, versions, lineage, and processing state, kept locally in SQLite DB `.datachain\u002Fdb`. Pipelines reference datasets by name, not paths. When the code or input data changes, the next run bumps dataset version.\n\nThis is what makes a **dataset a management unit:** owned, versioned, and queryable by everyone on the team.\n\n### 4.2. Schemas and types\n\nDataChain uses Pydantic to define the shape of every column. The return type of your UDF becomes the dataset schema - each field a queryable column in the Dataset DB.\n\n`show()` in the previous script renders nested fields as dotted columns:\n\n```bash\n                                          file    file  info   info\n                                          path    size width height\n0  oxford-pets-micro\u002Fimages\u002FAbyssinian_141.jpg  111270   461    500\n1  oxford-pets-micro\u002Fimages\u002FAbyssinian_157.jpg  139948   500    375\n2  oxford-pets-micro\u002Fimages\u002FAbyssinian_175.jpg   31265   600    234\n3  oxford-pets-micro\u002Fimages\u002FAbyssinian_220.jpg   10687   300    225\n4    oxford-pets-micro\u002Fimages\u002FAbyssinian_3.jpg   61533   600    869\n\n[Limited by 5 rows]\n```\n\n`.print_schema()` renders it's schema:\n```bash\nfile: File@v1\n  source: str\n  path: str\n  size: int\n  version: str\n  etag: str\n  is_latest: bool\n  last_modified: datetime\n  location: Union[dict, list[dict], NoneType]\ninfo: ImageInfo\n  width: int\n  height: int\n```\n\nModels can be arbitrarily nested - a `BBox` inside an `Annotation`, a `List[Citation]` inside an LLM Response - every leaf field stays queryable the same way. The schema lives in the Dataset DB and is enforced at dataset creation time.\n\nThe Dataset DB handles datasets of any size - 100 millions of files, hundreds of metadata rows - without loading anything into memory. **Pandas is limited by RAM; DataChain is not.** Export to pandas when you need it, on a filtered subset:\n\n```python\nimport datachain as dc\n\ndf = dc.read_dataset(\"pets_images\").filter(dc.C(\"info.width\") > 500).to_pandas()\nprint(df)\n```\n\n### 4.3. Fast queries\n\nFilters, aggregations, and joins run as vectorized operations directly against the Dataset DB - metadata never leaves your machine, no files downloaded.\n\n```python\nimport datachain as dc\n\ncnt = (\n    dc.read_dataset(\"pets_images\")\n    .filter(\n        (dc.C(\"info.width\") > 400) &\n        ~dc.C(\"file.path\").ilike(\"%cocker_spaniel%\")   # case-insensitive\n    )\n    .count()\n)\nprint(f\"Large images with Cocker Spaniel: {cnt}\")\n```\n\nMilliseconds, even at 100M-file scale.\n```\nLarge images with Cocker Spaniel: 6\n```\n\n## 5. Resilient Pipelines\n\nWhen computation is expensive, bugs and new data are both inevitable. DataChain tracks processing state in the Dataset DB - so crashes and new data are handled automatically, without changing how you write pipelines.\n\n### 5.1. Data checkpoints\n\nSave to `embed.py`:\n```python\nimport open_clip, torch, io\nfrom PIL import Image\nimport datachain as dc\n\nmodel, _, preprocess = open_clip.create_model_and_transforms(\"ViT-B-32\", \"laion2b_s34b_b79k\")\nmodel.eval()\n\ncounter = 0\n\ndef encode(file: dc.File, model, preprocess) -> list[float]:\n    global counter\n    counter += 1\n    if counter > 236:                                    # ← bug: remove these two lines\n        raise Exception(\"some bug\")                      # ←\n    img = Image.open(io.BytesIO(file.read())).convert(\"RGB\")\n    with torch.no_grad():\n        return model.encode_image(preprocess(img).unsqueeze(0))[0].tolist()\n\n(\n    dc.read_dataset(\"pets_images\")\n    .settings(batch_size=100)\n    .setup(model=lambda: model, preprocess=lambda: preprocess)\n    .map(emb=encode)\n    .save(\"pets_embeddings\")\n)\n```\n\nIt fails due to a bug in the code:\n```\nException: some bug\n```\n\nRemove the two marked lines and re-run - DataChain resumes from image 201 (two 100 size batches are completed), the start of the last uncommitted batch:\n\n```\n$ python embed.py\nUDF 'encode': Continuing from checkpoint\n```\n\n### 5.2. Similarity search\n\nThe vectors live in the Dataset DB alongside all the metadata - `list[float]` type in pydentic schemas. Querying them is instant - no files re-read and can be combined with not vector filters like `info.width`:\n\nPrepare data:\n```bash\ndatachain cp s3:\u002F\u002Fdc-readme\u002Ffiona.jpg .\n```\n\n`similar.py`:\n```python\nimport open_clip, torch, io\nfrom PIL import Image\nimport datachain as dc\n\nmodel, _, preprocess = open_clip.create_model_and_transforms(\"ViT-B-32\", \"laion2b_s34b_b79k\")\nmodel.eval()\n\nref_emb = model.encode_image(\n    preprocess(Image.open(\"fiona.jpg\")).unsqueeze(0)\n)[0].tolist()\n\n(\n    dc.read_dataset(\"pets_embeddings\")\n    .filter(dc.C(\"info.width\") > 500)          # from pets_images - no re-read\n    .mutate(dist=dc.func.cosine_distance(dc.C(\"emb\"), ref_emb))\n    .order_by(\"dist\")\n    .limit(3)\n    .show()\n)\n```\n\nUnder a second - everything runs against the Dataset DB.\n\n\n### 5.3. Incremental updates\n\nThe bucket in this walkthrough is static, so there's nothing new to process. But in production - when new images land in your bucket - re-run the same scripts unchanged. `delta=True` in the original dataset ensures only new files are processed end to end while the whole dataset will be updated to `pets_images@1.0.1`:\n\n```python\n$ python create_dataset.py   # 500 new images arrived\nSkipping 10,000 unchanged  ·  indexing 500 new\nSaved pets_images@1.0.1  (+500 records)\n\n# Next day:\n\n$ python create_dataset.py\nSkipping 10,000 unchanged  ·  processing 500 new\nSaved pets_images@1.0.2  (+500 records)\n```\n\n## 6. Knowledge Base\n\nDataChain maintains two layers. The **Dataset DB** is the ground truth: schemas, processing state, lineage, the vectors themselves. **The Knowledge Base** is derived from it: structured markdown for humans and agents to read. Because it's derived, it's always accurate. The Knowledge Base is stored in `dc-knowledge\u002F`.\n\nAsk the agent to build it (from Calude Code, Codex or Cursor):\n```bash\nclaude\n```\n\nPrompt:\n```prompt\nBuild a Knowledge Base for my current datasets\n```\n\nThe skill generates `dc-knowledge\u002F` directory from the Dataset DB - one file per dataset and bucket:\n\n\n## 7. AI-Generated Pipelines\n\nThe skill gives the agent data awareness: it reads `dc-knowledge\u002F` to understand what datasets exist, their schemas, which fields can be joined - and the meaning of columns inferred from the code that produced them.\n\nSee section `2. Quickstart: agent-driven pipeline` above. All the steps that were manually created could be just generated.\n\n\n## 8. Team and cloud: Studio\n\nData context built locally stays local. DataChain Studio makes it shared.\n\n```bash\ndatachain auth login\ndatachain job run --workers 20 --cluster gpu-pool caption.py\n# ✓ Job submitted → studio.datachain.ai\u002Fjobs\u002F1042\n# Resuming from checkpoint (4,218 already done)...\n# Saved oxford-pets-caps@0.0.1  (3,182 processed)\n```\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Fstudio_architecture.svg\" alt=\"DataChain Studio Architecture\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\nStudio adds: shared dataset registry, access control, UI for video\u002FDICOM\u002FNIfTI\u002Fpoint clouds, lineage graphs, reproducible runs.\n\nBring Your Own Cloud - all data and compute stay in your infrastructure. AWS, GCP, Azure, on-prem Kubernetes.\n\n→ [studio.datachain.ai](https:\u002F\u002Fstudio.datachain.ai)\n\n## 9. Contributing\n\nContributions are very welcome. To learn more, see the [Contributor Guide](https:\u002F\u002Fdocs.datachain.ai\u002Fcontributing).\n\n## 10. Community and Support\n\n- [Report an issue](https:\u002F\u002Fgithub.com\u002Fdatachain-ai\u002Fdatachain\u002Fissues) if you encounter any problems\n- [Docs](https:\u002F\u002Fdocs.datachain.ai\u002F)\n- [Email](mailto:support@datachain.ai)\n- [Twitter](https:\u002F\u002Ftwitter.com\u002Fdatachain_ai)\n","DataChain 是一个用于处理非结构化数据的上下文层，支持图像、视频、文档和表格等类型的数据集版本控制与查询。该项目采用 Python 语言编写，提供了一个计算引擎，支持并行和分布式文件处理、异步 I\u002FO、断点恢复及增量更新；同时通过 Dataset DB 实现了 Pydantic 模式定义、版本管理、文件指针及自动血缘追踪等功能，能够在数亿条记录上实现亚秒级的过滤、连接和相似性搜索。此外，DataChain 还为 AI 代理工作流提供了知识库和代理框架，使得像 Claude Code 等工具可以直接理解并操作用户的数据。该工具适用于需要高效管理和利用大规模非结构化数据的企业或研究机构，特别适合于构建基于 AI 的数据处理管道。",2,"2026-06-11 03:41:55","high_star"]