[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74156":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":15,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},74156,"dataclaw","peteromallet\u002Fdataclaw","peteromallet","Agent harness to publish your history from Claude Code et al. as Huggingface datasets. ","",null,"Python",2093,235,9,7,0,1,28,3,67.42,"MIT License",false,"main",true,[],"2026-06-12 04:01:13","# DataClaw\n\n> **This is a performance art project.** Anthropic built their models on the world's freely shared information, then introduced increasingly [dystopian data policies](https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fdetecting-and-preventing-distillation-attacks) to stop anyone else from doing the same with their data - pulling up the ladder behind them. DataClaw lets you throw the ladder back down. The dataset it produces is yours to share.\n\nTurn your Claude Code, Codex, and other coding-agent conversation history into structured data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset.\n\n![DataClaw](dataclaw.jpeg)\n\nEvery export is tagged **`dataclaw`** on Hugging Face. Together, they may someday form a growing [distributed dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets?other=dataclaw) of real-world human-AI coding collaboration.\n\n## Download for Mac\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fpeteromallet\u002Fdataclaw\u002Freleases\u002Flatest\u002Fdownload\u002FDataClaw-macOS-Apple-Silicon.dmg\">\n    \u003Cimg alt=\"Download DataClaw for Apple Silicon Macs\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDownload%20for%20Mac-Apple%20Silicon-111111?style=for-the-badge&logo=apple&logoColor=white\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fpeteromallet\u002Fdataclaw\u002Freleases\u002Flatest\">\n    \u003Cimg alt=\"View GitHub Releases\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FView%20Releases-GitHub-0969da?style=for-the-badge&logo=github&logoColor=white\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\nDataClaw ships as a Mac menu-bar app for Apple Silicon MacBooks and desktop Macs. Download the DMG, drag `DataClaw.app` to Applications, then launch it from Applications or Spotlight.\n\nThe current GitHub DMG is unsigned while Apple Developer ID signing is being set up, so macOS may show a Gatekeeper warning on first launch. The app bundles the DataClaw sidecar, so Mac users do not need to install Python, PyInstaller, or the CLI separately.\n\n## Install Options\n\n### Mac app\n\n1. [Download DataClaw for Mac](https:\u002F\u002Fgithub.com\u002Fpeteromallet\u002Fdataclaw\u002Freleases\u002Flatest\u002Fdownload\u002FDataClaw-macOS-Apple-Silicon.dmg).\n2. Open the DMG and drag `DataClaw.app` to Applications.\n3. Launch DataClaw from Applications or Spotlight.\n\nAll release assets are also available on [GitHub Releases](https:\u002F\u002Fgithub.com\u002Fpeteromallet\u002Fdataclaw\u002Freleases\u002Flatest). Intel Mac users can use the CLI install for now.\n\nOnce Apple signing secrets are configured, releases can be signed, notarized, and configured for in-app updates through GitHub Releases.\n\n### CLI only\n\nUse this if you want the terminal workflow or you are asking a coding agent to run DataClaw for you:\n\n```bash\npip install -U dataclaw\n```\n\n## Give this to your agent\n\nPaste this into Claude Code, Codex, or any coding agent:\n\n```\nHelp me export my Claude Code, Codex, and other coding-agent conversation history to Hugging Face using DataClaw.\nInstall it, then walk me through the process.\n\nSTEP 1 - INSTALL\n  pip install -U dataclaw\n  If that fails: git clone https:\u002F\u002Fgithub.com\u002Fpeteromallet\u002Fdataclaw.git \u002Ftmp\u002Fdataclaw && pip install -U \u002Ftmp\u002Fdataclaw\n  If that also fails, ask the user where the source is.\n\nSTEP 2 - INSTALL SKILL\n  Skill support is currently only available for Claude Code.\n  dataclaw update-skill claude\n  For other agentic tools, skip this step and do not improvise a custom flow - follow the instructions in DataClaw's output on each step, especially next_steps and next_command.\n\nSTEP 3 - PREP\n  dataclaw prep\n  Every dataclaw command outputs next_steps in its JSON - follow them through the entire flow.\n\nSTEP 3A - CHOOSE SOURCE SCOPE (REQUIRED BEFORE EXPORT)\n  Ask the user explicitly which source scope to export: a supported source key such as claude or codex, or all.\n  dataclaw config --source all\n  Do not export until source scope is explicitly confirmed.\n\nSTEP 3B - CHOOSE PROJECT SCOPE (REQUIRED BEFORE EXPORT)\n  dataclaw list --source all\n  Send the FULL project\u002Ffolder list to the user in a message (name, source, sessions, size, excluded).\n  Ask which projects to exclude.\n  dataclaw config --exclude \"project1,project2\" OR dataclaw config --confirm-projects\n  Do not export until folder selection is explicitly confirmed.\n\nSTEP 3C - SET REDACTED STRINGS\n  Ask the user what additional strings should always be redacted, such as company names, client names, domains, internal URLs, or secrets that regex might miss.\n  dataclaw config --redact \"string1,string2\"\n  dataclaw config --redact-usernames \"user1,user2\"\n  Only add these after explicit user confirmation.\n\nSTEP 4 - EXPORT LOCALLY\n  dataclaw export --no-push --output dataclaw_export.jsonl\n\nSTEP 5 - REVIEW AND CONFIRM (REQUIRED BEFORE PUSH)\n  Review PII findings and apply excludes\u002Fredactions as needed.\n  Full name is requested for an exact-name privacy scan against the export.\n  If the user declines sharing full name, use --skip-full-name-scan and attest the skip reason.\n  dataclaw confirm --full-name \"THEIR FULL NAME\" --attest-full-name \"...\" --attest-sensitive \"...\" --attest-manual-scan \"...\"\n\nSTEP 6 - PUBLISH (ONLY AFTER EXPLICIT USER APPROVAL)\n  dataclaw export --publish-attestation \"User explicitly approved publishing to Hugging Face.\"\n  Never publish unless the user explicitly says yes.\n\nIF ANY COMMAND FAILS DUE TO A SKIPPED STEP:\n  Restate the 6-step checklist above and resume from the blocked step (do not skip ahead).\n\nIMPORTANT: Never run bare `hf auth login` when automating this with an agent - always use `--token`.\nIMPORTANT: Always export with --no-push first and review for PII before publishing.\n```\n\n## Manual usage (without an agent)\n\n```bash\n# STEP 1 - INSTALL\npip install -U dataclaw\nhf auth login --token YOUR_TOKEN\n\n# STEP 3 - PREP\ndataclaw prep\ndataclaw config --repo username\u002Fmy-personal-codex-data\n\n# STEP 3A - CHOOSE SOURCE SCOPE\ndataclaw config --source all  # REQUIRED: choose a supported source key or all\n\n# STEP 3B - CHOOSE PROJECT SCOPE\ndataclaw list --source all  # Present full list and confirm folder scope before export\ndataclaw config --exclude \"personal-stuff,scratch\"  # or: dataclaw config --confirm-projects\n\n# STEP 3C - SET REDACTED STRINGS\ndataclaw config --redact-usernames \"my_github_handle,my_discord_name\"\ndataclaw config --redact \"my-domain.com,my-secret-project\"\n\n# STEP 4 - EXPORT LOCALLY\ndataclaw export --no-push\n\n# STEP 5 - REVIEW AND CONFIRM\ndataclaw confirm \\\n  --full-name \"YOUR FULL NAME\" \\\n  --attest-full-name \"Asked for full name and scanned export for YOUR FULL NAME.\" \\\n  --attest-sensitive \"Asked about company\u002Fclient\u002Finternal names and private URLs; none found or redactions updated.\" \\\n  --attest-manual-scan \"Manually scanned 20 sessions across beginning\u002Fmiddle\u002Fend and reviewed findings.\"\n\n# Or: if user declines sharing full name\ndataclaw confirm \\\n  --skip-full-name-scan \\\n  --attest-full-name \"User declined to share full name; skipped exact-name scan.\" \\\n  --attest-sensitive \"Asked about company\u002Fclient\u002Finternal names and private URLs; none found or redactions updated.\" \\\n  --attest-manual-scan \"Manually scanned 20 sessions across beginning\u002Fmiddle\u002Fend and reviewed findings.\"\n\n# STEP 6 - PUBLISH\ndataclaw export --publish-attestation \"User explicitly approved publishing to Hugging Face.\"\n```\n\nStep 2 (INSTALL SKILL) is omitted in manual usage.\n\n### Commands\n\n| Command | Description |\n|---------|-------------|\n| `dataclaw status` | Show current stage and next steps |\n| `dataclaw prep` | Discover projects, check HF auth, output JSON |\n| `dataclaw prep --source \u003Csource\\|all>` | Prep with an explicit source scope |\n| `dataclaw list` | List all projects with exclusion status |\n| `dataclaw list --source \u003Csource\\|all>` | List projects for a specific source scope |\n| `dataclaw config` | Show current config |\n| `dataclaw config --repo user\u002Fmy-personal-codex-data` | Set HF repo |\n| `dataclaw config --source \u003Csource\\|all>` | REQUIRED source scope selection (examples include `claude`, `codex`, and others) |\n| `dataclaw config --exclude \"a,b\"` | Add excluded projects (appends) |\n| `dataclaw config --redact \"str1,str2\"` | Add strings to always redact (appends) |\n| `dataclaw config --redact-usernames \"u1,u2\"` | Add usernames to anonymize (appends) |\n| `dataclaw config --confirm-projects` | Mark project selection as confirmed |\n| `dataclaw export --no-push` | Export locally only (always do this first) |\n| `dataclaw export --source \u003Csource\\|all> --no-push` | Export a chosen source scope locally |\n| `dataclaw confirm --full-name \"NAME\" --attest-full-name \"...\" --attest-sensitive \"...\" --attest-manual-scan \"...\"` | Scan for PII, run exact-name privacy check, verify review attestations, unlock pushing |\n| `dataclaw confirm --skip-full-name-scan --attest-full-name \"...\" --attest-sensitive \"...\" --attest-manual-scan \"...\"` | Skip exact-name scan when user declines sharing full name (requires skip attestation) |\n| `dataclaw export --publish-attestation \"...\"` | Export and push (requires `dataclaw confirm` first) |\n| `dataclaw export --all-projects` | Include everything (ignore exclusions) |\n| `dataclaw export --no-thinking` | Exclude extended thinking blocks |\n| `dataclaw jsonl-to-yaml [input.jsonl]` | Convert an export JSONL file to human-readable YAML |\n| `dataclaw diff-jsonl --old old.jsonl --new new.jsonl` | Structurally diff two export JSONL files and write YAML |\n| `dataclaw update-skill claude` | Install\u002Fupdate the dataclaw skill for Claude Code |\n\nSet `DATACLAW_WORKERS` to control the worker count used by parallel operations such as `export`, `confirm`, and `diff-jsonl`.\n\n## What gets exported\n\n- User messages - Including voice transcripts and images\n- Assistant responses\n- Assistant thinking - Opt out with `--no-thinking`\n- Tool calls - Tool name, inputs, outputs\n- Token usage - Input\u002Foutput tokens per session\n- Metadata - Model name, git branch, timestamps\n\n### Privacy & Redaction\n\nDataClaw applies multiple layers of protection:\n\n1. Username redaction - Your OS username + any configured usernames replaced with stable hashes\n2. Secret redaction - Regex patterns catch JWT tokens, API keys (Anthropic, OpenAI, HF, GitHub, AWS, etc.), database passwords, private keys, Discord webhooks, and more\n3. Entropy analysis - Long high-entropy strings in quotes are flagged as potential secrets\n4. Email redaction - Regex pattern catches email addresses\n5. Custom redaction - You can configure additional strings to redact\n6. Tool call redaction - Tool inputs and outputs are redacted with the same standard as regular messages\n\n**This is NOT foolproof.** Always review your exported data before publishing.\nAutomated redaction cannot catch everything - especially service-specific\nidentifiers, third-party PII, or secrets in unusual formats.\n\nWe recommend converting the exported jsonl into human-readable yaml using `dataclaw jsonl-to-yaml`,\nthen use tools such as [trufflehog](https:\u002F\u002Fgithub.com\u002Ftrufflesecurity\u002Ftrufflehog) and [gitleaks](https:\u002F\u002Fgithub.com\u002Fgitleaks\u002Fgitleaks) to scan it.\nYou can also compare the exported jsonl with a previous baseline using `dataclaw diff-jsonl`.\n\nTo help improve redaction, report issues: https:\u002F\u002Fgithub.com\u002Fpeteromallet\u002Fdataclaw\u002Fissues\n\n### Data schema\n\nEach line in `conversations.jsonl` is one session:\n\n```json\n{\n  \"session_id\": \"abc-123\",\n  \"project\": \"my-project\",\n  \"model\": \"claude-opus-4-6\",\n  \"git_branch\": \"main\",\n  \"start_time\": \"2025-06-15T10:00:00+00:00\",\n  \"end_time\": \"2025-06-15T10:30:00+00:00\",\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"Fix the login bug\",\n      \"content_parts\": [\n        {\"type\": \"image\", \"source\": {\"type\": \"base64\", \"media_type\": \"image\u002Fpng\", \"data\": \"...\"}}\n      ],\n      \"timestamp\": \"...\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"I'll investigate the login flow.\",\n      \"thinking\": \"The user wants me to look at...\",\n      \"tool_uses\": [\n          {\n            \"tool\": \"bash\",\n            \"input\": {\"command\": \"grep -r 'login' src\u002F\"},\n            \"output\": {\n              \"text\": \"src\u002Fauth.py:42: def login(user, password):\",\n              \"raw\": {\"stderr\": \"\", \"interrupted\": false}\n            },\n            \"status\": \"success\"\n          }\n        ],\n      \"timestamp\": \"...\"\n    }\n  ],\n  \"stats\": {\n    \"user_messages\": 5, \"assistant_messages\": 8,\n    \"tool_uses\": 20, \"input_tokens\": 50000, \"output_tokens\": 3000\n  }\n}\n```\n\n`messages[].content_parts` is optional and preserves structured user content such as attachments when the source provides them. The canonical human-readable user text remains in `messages[].content`.\n\n`tool_uses[].output.raw` is optional and preserves extra structured tool-result fields when the source provides them. The canonical human-readable result text remains in `tool_uses[].output.text`.\n\nEach HF repo also includes a `metadata.json` with aggregate stats.\n\n## Finding datasets on Hugging Face\n\nAll repos are tagged `dataclaw`.\n\n- **Browse all:** [huggingface.co\u002Fdatasets?other=dataclaw](https:\u002F\u002Fhuggingface.co\u002Fdatasets?other=dataclaw)\n- **Load one:**\n  ```python\n  from datasets import load_dataset\n  ds = load_dataset(\"alice\u002Fmy-personal-codex-data\", split=\"train\")\n  ```\n- **Combine several:**\n  ```python\n  from datasets import load_dataset, concatenate_datasets\n  repos = [\"alice\u002Fmy-personal-codex-data\", \"bob\u002Fmy-personal-codex-data\"]\n  ds = concatenate_datasets([load_dataset(r, split=\"train\") for r in repos])\n  ```\n\nThe auto-generated HF README includes:\n- Model distribution (which models, how many sessions each)\n- Total token counts\n- Project count\n- Last updated timestamp\n\n## Contributing\n\n**Missing data:** If you found any data not exported, please report an issue. You can ask your coding agent to analyze the data, export it in this repo, and open a PR.\n\n**Better scheme:** If you need to clean the data and want to propose a better scheme, feel free to open an issue.\n\n**New provider:** If you use a new coding agent, you can ask it to read this repo and export its data as a new provider. Take Claude Code and Codex parsers as examples because they are the most well maintained. When you finish, ask the following questions:\n- Did you follow the scheme above? Currently it's free to add custom fields in `messages[].content_parts` and `tool_uses[].output.raw`.\n- Did you export all data, especially:\n  - tool call inputs and outputs\n  - long inputs and outputs that may be saved somewhere else\n  - binary content (may be encoded as base64) such as images, in both user messages and tool calls. We do not apply anonymizer on binary content\n  - subagents\n- Does the coding agent automatically delete old sessions? How to prevent this?\n\n## Code Quality\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"scorecard.png\" alt=\"Code Quality Scorecard\">\n\u003C\u002Fp>\n\n## License\n\nMIT\n","DataClaw 是一个用于将Claude Code、Codex等编码代理的对话历史转换为结构化数据并发布到Hugging Face的数据集工具。它能够解析会话日志，自动删除敏感信息和个人身份信息（PII），并通过单个命令上传至Hugging Face作为即用型数据集。项目采用Python语言编写，提供Mac菜单栏应用和CLI两种安装方式，简化了用户的操作流程。适合于希望共享自己的AI协作编程经历或对构建分布式AI交互数据集感兴趣的开发者使用。",2,"2026-06-11 03:49:04","high_star"]