[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79293":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":44,"readmeContent":45,"aiSummary":46,"trendingCount":16,"starSnapshotCount":16,"syncStatus":47,"lastSyncTime":48,"discoverSource":49},79293,"duckle","SouravRoy-ETL\u002Fduckle","SouravRoy-ETL","Local-first ETL\u002FELT studio: a drag-and-drop visual pipeline designer that compiles to SQL and runs on DuckDB. Tiny desktop app, no servers, git-friendly workspaces.","",null,"Rust",389,28,6,11,0,9,146,319,70,92.39,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43],"data-engineering","data-integration","data-pipeline","data-quality","desktop-app","drag-and-drop","duckdb","elt","embedded","etl","local-first","react","rust","sql","tauri","typescript","vector-database","2026-06-12 04:01:24","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"docs\u002Fassets\u002Fhero.svg\" alt=\"Duckle\" width=\"100%\"\u002F>\n\n\u003Ch3>The local-first data studio with a built-in AI assistant.\u003C\u002Fh3>\n\n\u003Cp>\u003Cb>Duckle\u003C\u002Fb> is an open-source desktop ETL \u002F ELT studio. Drag a pipeline onto the canvas, describe what you need in plain English to \u003Cb>Duckie\u003C\u002Fb> (the on-device AI assistant), and execute at native speed through DuckDB. 290+ connectors, 50+ transforms, a built-in scheduler, and a chat assistant that runs entirely on your CPU. Ships as a ~30 MB desktop app. No cloud, no servers, no lock-in.\u003C\u002Fp>\n\n\u003Cp>\n\u003Cimg alt=\"status\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fstatus-beta-3b82f6\"\u002F>\n\u003Cimg alt=\"license\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT%20OR%20Apache--2.0-blue\"\u002F>\n\u003Cimg alt=\"platforms\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fplatforms-Windows%20%C2%B7%20macOS%20%C2%B7%20Linux-2b6cb0\"\u002F>\n\u003Cimg alt=\"rust\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRust-000000?logo=rust&logoColor=white\"\u002F>\n\u003Cimg alt=\"tauri\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTauri%202-24C8DB?logo=tauri&logoColor=white\"\u002F>\n\u003Cimg alt=\"react\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FReact%2019-20232A?logo=react&logoColor=61DAFB\"\u002F>\n\u003Cimg alt=\"typescript\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTypeScript-3178C6?logo=typescript&logoColor=white\"\u002F>\n\u003Cimg alt=\"duckdb\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDuckDB-FFF000?logo=duckdb&logoColor=black\"\u002F>\n\u003Cimg alt=\"stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSouravRoy-ETL\u002Fduckle?style=social\"\u002F>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n---\n\n## Quick links\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd valign=\"top\" width=\"25%\">\n\n**Get started**\n\n- [What is Duckle?](#what-is-duckle)\n- [Quickstart (60 s)](#quickstart-60-seconds)\n- [Download \u002F Install](#download--install)\n- [Build from source](#build-from-source)\n- [Run your first pipeline](#run-your-first-pipeline)\n\n\u003C\u002Ftd>\n\u003Ctd valign=\"top\" width=\"25%\">\n\n**Use the product**\n\n- [Meet Duckie (AI)](#meet-duckie---the-local-ai-pipeline-assistant)\n- [How to use Duckle](#how-to-use-duckle)\n- [Recipes \u002F examples](#recipes-and-examples)\n- [In-app Git (GitHub\u002FGitLab)](#git-integration-github--gitlab)\n- [Workspace + Git flow](#workspace-and-git-flow)\n- [Schedules](#schedules-and-triggers)\n- [Connection management](#connection-management)\n- [Context variables](#context-variables)\n\n\u003C\u002Ftd>\n\u003Ctd valign=\"top\" width=\"25%\">\n\n**Reference**\n\n- [Capabilities matrix](#capabilities)\n- [Sources](#sources-74-available)\n- [Transforms](#transforms-126-available)\n- [Sinks](#sinks-58-available)\n- [Data quality](#data-quality-12-available)\n- [Custom code](#custom-code-7-available)\n- [Control flow](#control-flow-14-available)\n- [Advanced settings](#advanced-settings-per-node)\n- [Engines](#engines)\n- [Configuration](#configuration)\n\n\u003C\u002Ftd>\n\u003Ctd valign=\"top\" width=\"25%\">\n\n**Resources**\n\n- [Architecture](#architecture)\n- [Clean data for AI](#clean-data-before-it-reaches-your-ai)\n- [Performance tips](#performance-tips)\n- [FAQ](#faq)\n- [Troubleshooting](#troubleshooting)\n- [CI \u002F CD](#ci--cd)\n- [Status](#status)\n- [Roadmap](#roadmap)\n- [Contributing](#contributing)\n- [License](#license)\n- [Releases](https:\u002F\u002Fgithub.com\u002FSouravRoy-ETL\u002Fduckle\u002Freleases)\n- [Roadmap doc](docs\u002Froadmap.md)\n- [Contributing doc](CONTRIBUTING.md)\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n---\n\n## What is Duckle?\n\nA visual data pipeline studio that runs on your laptop. Drag sources, transforms, validators, and sinks onto a canvas. Wire them together. Press **Run**. Duckle compiles the graph to SQL and executes it through a real columnar engine, with live previews, generated SQL on every node, and zero hidden state.\n\nThree things make Duckle different from the heavyweights and the toy ETL tools:\n\n1. **An AI assistant that ships in the box.** Describe the pipeline you want in English; Duckie writes the JSON and drops it onto the canvas. The model runs locally - no API key, no telemetry, no cloud round-trip.\n2. **290+ connectors at install time.** Files, lakehouses, SQL databases, warehouses, NoSQL, vector DBs, streaming brokers, SaaS REST\u002FGraphQL APIs, even FTP and IMAP - working today, not coming-soon.\n3. **A self-contained binary you can audit.** ~30 MB download. Engines install on first launch. Workspaces are plain files in a folder you choose. Diff them, branch them, ship them.\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"docs\u002Fassets\u002Fflow.svg\" alt=\"Sources flow through 50+ transforms into files, databases, object storage, vector stores, and AI\" width=\"100%\"\u002F>\n\u003C\u002Fdiv>\n\n---\n\n## Meet Duckie - the local AI pipeline assistant\n\n> Describe what you need. Duckie writes the pipeline.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"docs\u002Fassets\u002Freal-life-screenshot\u002Fduckie-ai-assistant.png\" alt=\"Duckie AI Assistant chat panel generating a CSV-to-Parquet pipeline from a natural-language description\" width=\"100%\"\u002F>\n\u003C\u002Fp>\n\nThe sidebar on the right is **Duckie AI Assistant** - powered by **Qwen 2.5 Coder 1.5B** running through **llama.cpp**, downloaded once (~1.1 GB) and then run entirely on your CPU. Ask in plain English; Duckie streams back a valid Duckle pipeline definition. One click drops it onto the canvas, ready to inspect, tweak, and run.\n\n| | |\n|---|---|\n| **Truly local** | The Qwen model runs as a `llama-server` subprocess on `127.0.0.1`. No API keys. No network calls. Disconnect your wifi and it keeps working. |\n| **Streamed responses** | Tokens arrive as they're generated, with a blinking caret in the bubble. No \"wait 20 seconds for the spinner to vanish\" UX. |\n| **One-click insert** | When Duckie produces a JSON pipeline, an **Insert into canvas** button appears. The graph populates with positioned nodes, wired edges, and the props the model chose. |\n| **Bring-your-own-model option** | The chat plumbing is the same OpenAI-compatible HTTP interface used by `xf.ai.llm` \u002F `xf.ai.embed` connectors. Point `baseUrl` at Ollama, llama.cpp, Cohere, OpenAI, Voyage - anything that speaks the OpenAI shape. |\n| **Sandboxed** | The model has no fs \u002F net \u002F tool access. It can only emit text - your pipeline JSON. |\n\n---\n\n## Why Duckle is different\n\n| | |\n|---|---|\n| **Visual, never opaque** | The canvas compiles to SQL you can read, and every node has a live preview tab. No black box. |\n| **Local-first AI** | An assistant that runs on your laptop without an API key. Your prompts, your data, your machine. |\n| **Compact binary, no bundled DB** | ~30 MB app. DuckDB downloads on first launch with a guided step. AI engine is opt-in. |\n| **Native speed** | Execution runs through DuckDB: vectorized, columnar, local. A clean-and-export job that crawls in a spreadsheet finishes in milliseconds. |\n| **Git-friendly by design** | Pipelines, connections, contexts, and routines persist as plain files in a folder you pick. Diff them, branch them, review them. |\n| **290+ connectors that work** | Files, databases, warehouses, lakehouses, object stores, SaaS APIs, NoSQL, streaming brokers, vector DBs, FTP, IMAP, SMTP. Each is covered by tests. |\n| **Honest about scope** | Single-machine and embedded by design. Built to make local and small-team data work fast, not to replace a distributed warehouse. |\n| **60 UI languages** | Topbar, palette, chat assistant, properties panel, and common dialogs ship localized. English, Spanish, Chinese (Simplified + Traditional), Hindi, Arabic, Portuguese (Brazil), Bengali, Russian, Japanese, Punjabi, German, Korean, French, Vietnamese, Telugu, Marathi, Turkish, Tamil, Urdu, Persian, Polish, Italian, Ukrainian, Indonesian, Thai, Dutch, Hebrew, Swedish, Greek, Czech, Hungarian, Romanian, Filipino, Malay, Norwegian, Danish, Finnish, Catalan, Bulgarian, Slovak, Croatian, Serbian, Slovenian, Lithuanian, Latvian, Estonian, Khmer, Burmese, Sinhala, Nepali, Swahili, Afrikaans, Welsh, Irish, Icelandic, Albanian, Azerbaijani, Mongolian, Kazakh. RTL (Arabic, Hebrew, Persian, Urdu) supported. Switch languages from the topbar globe. |\n| **Open source** | Dual-licensed MIT OR Apache-2.0. Yours to use, fork, and extend. |\n\n---\n\n## Status\n\nDuckle is in **public beta**. The visual designer, the DuckDB execution engine, the scheduler, the cloud connectors, and the Duckie AI assistant all work today and are covered by 170+ integration tests across Linux, macOS, and Windows. The catalog is still growing and APIs may evolve before 1.0, but the day-to-day surface is stable enough for real work.\n\n**Scope, stated plainly:** Duckle is a single-machine, embedded studio. If you outgrow one box, point Duckle's output at the system that scales (a warehouse, an object store, a lakehouse). It will not pretend to be a cluster.\n\nThe component palette ships **313 nodes** so the roadmap is visible in the product itself:\n\n- **292 available** runs on the DuckDB engine today\n- **5 preview** is configurable in the designer (drag, wire, set properties); execution is being wired engine-by-engine\n- **16 planned** is reserved in the palette but not yet executable - see [`docs\u002Froadmap.md`](docs\u002Froadmap.md)\n\n---\n\n## Screenshots\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Freal-life-screenshot\u002F1.png\" alt=\"The Duckle visual designer with a CSV to Filter to Parquet pipeline\" width=\"100%\"\u002F>\n  \u003Cbr\u002F>\n  \u003Csub>The visual designer: source -> filter -> sink, with the generated SQL one click away on every node.\u003C\u002Fsub>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Freal-life-screenshot\u002F2.png\" alt=\"Component palette and schema autodetect\" width=\"49%\"\u002F>\n  \u003Cimg src=\"docs\u002Fassets\u002Freal-life-screenshot\u002F3.png\" alt=\"Parquet sink configuration in dark theme\" width=\"49%\"\u002F>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Csub>Left: component palette with one-click schema autodetect. Right: sink configuration with write modes, compression, and partitioning, in dark theme.\u003C\u002Fsub>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Freal-life-screenshot\u002Fduckie-ai-assistant.png\" alt=\"Duckie AI Assistant generating a Duckle pipeline from a natural-language description\" width=\"100%\"\u002F>\n  \u003Cbr\u002F>\n  \u003Csub>Duckie AI Assistant: describe the pipeline, get a one-click insert.\u003C\u002Fsub>\n\u003C\u002Fp>\n\n---\n\n## Capabilities\n\nDuckle is not a CSV tool with extras. It reads a broad set of formats and sources, ships a deep transform library, and writes to files, databases, object storage, vector DBs, message buses, and email.\n\n### Sources (74 available)\n\n| Group | Connectors | Status |\n|---|---|---|\n| **Files** | CSV, TSV, Parquet, JSON, JSONL \u002F NDJSON, Excel (.xlsx), YAML, TOML, Fixed-width (mainframe \u002F banking positional dumps), XML (slash-separated rowPath), Apache Avro (.avro \u002F .ocf, pure-Rust) | Available |\n| **Geospatial files** | GeoJSON, Shapefile, GeoPackage, KML, GPX, GML via the `spatial` extension | Available (lazy-loaded) |\n| **Lakehouse table formats** | Apache Iceberg, Delta Lake, DuckLake | Available |\n| **Embedded databases** | SQLite (read tables), DuckDB (read tables or run a query) | Available |\n| **Network relational DBs** | PostgreSQL, MySQL, MariaDB, CockroachDB | Available (live CI for PG + MySQL) |\n| **Network relational DBs** | SQL Server (TDS), Oracle (Instant Client at runtime), ClickHouse (HTTP API) | Available |\n| **Network relational DBs** | IBM DB2, generic JDBC | Planned |\n| **Object storage** | Amazon S3, Google Cloud Storage, Azure Blob, HTTP(S), MinIO, Cloudflare R2, Backblaze B2 | Available (live CI for MinIO) |\n| **Cloud warehouses** | MotherDuck, Snowflake (SQL API + PAT\u002FJWT), BigQuery, Redshift (postgres ATTACH), Databricks SQL (Statement Execution + chunk follow), Azure Synapse (TDS), **DuckDB Quack** (May 2026 remote protocol - HTTP on :9494, SECRET-based token auth) | Available |\n| **Streaming** | Apache Kafka \u002F Redpanda (pure-Rust `rskafka`), NATS JetStream, GCP Pub\u002FSub (REST + auto-ack), RabbitMQ (`lapin` AMQP), AWS Kinesis (HTTP + SigV4 - no AWS SDK) | Available |\n| **Streaming** | Pulsar, Event Hubs, multi-shard Kinesis | Planned |\n| **APIs and SaaS (REST)** | Salesforce, HubSpot, Pipedrive, Zendesk, Intercom, Stripe, QuickBooks, Xero, Shopify, Notion, Airtable, Asana, Trello, ClickUp, Monday.com, GitHub, GitLab, Linear, Jira, Slack, Discord, Telegram, Twilio, Mailchimp, SendGrid, Segment - thin pre-configured wrappers over `src.rest` \u002F `src.graphql` | Available |\n| **APIs (protocols)** | OData v4 (follows `@odata.nextLink`), SOAP \u002F generic XML APIs (XML response parsing with namespace local-name match) | Available |\n| **NoSQL and search** | MongoDB (official driver), Cassandra \u002F ScyllaDB (CQL), Elasticsearch \u002F OpenSearch (from+size + search_after), Redis (SCAN + GET), CouchDB (`_all_docs`), DynamoDB (HTTP + SigV4 - no AWS SDK; auto-unwraps typed attributes) | Available |\n| **Vector \u002F AI databases** | pgvector (postgres ATTACH), Qdrant (`\u002Fpoints\u002Fscroll`), Weaviate (`\u002Fv1\u002Fobjects`), Milvus (`\u002Fv1\u002Fvector\u002Fquery`) | Available |\n| **Vector \u002F AI databases** | Pinecone (no list-all-vectors API), Chroma, LanceDB | Preview |\n| **File transfer** | FTP \u002F FTPS (pure-Rust `suppaftp` with glob filter and base64-content per file) | Available |\n| **Mailbox** | IMAP (rustls TLS, `mail-parser`) - basic auth today, OAuth (gmail \u002F o365) on the roadmap | Available |\n| **Webhook listener** | Binds `127.0.0.1:port`, collects N inbound HTTP requests with a timeout, parses JSON-object \u002F JSON-array bodies into rows | Available |\n| **Desktop** | System clipboard (pure-Rust `arboard`, auto-detects JSON-array shape) | Available |\n| **Repos** | Git (commit log or file tree from a local working copy; shells out to system `git` CLI) | Available |\n\n### Transforms (126 available)\n\n| Group | Operations |\n|---|---|\n| **Fields** | Map (visual row mapper), Project \u002F Select, Cast, Rename, Add \u002F Drop \u002F Reorder Column, Coalesce, UUID v4 |\n| **Rows** | Filter (visual or raw SQL, with reject port), Distinct, Sample, Top N \u002F Limit, Sort, Skip, Top N per Group, Forward Fill, Backward Fill, Constant Fill |\n| **Aggregate** | Group By, Rollup, Cube, Count, Window Aggregate, Cumulative, Approx Quantile (t-digest), Approx Count Distinct (HyperLogLog) |\n| **Join** | Inner, Left, Right, Full Outer, Cross, Lookup, Semi, Anti, Spatial Join |\n| **Set operations** | Union, Union All, Intersect, Except \u002F Minus |\n| **Window** | Row Number, Rank, Dense Rank, Lead, Lag, First Value, Last Value, NTile |\n| **Strings** | Regex Replace, Regex Extract, Regex Match, Split, Concat, Trim, Case Change, Length, Substring, Format, Hash (md5 \u002F sha1 \u002F sha256), IP Parse, URL Parse, Text Similarity (Levenshtein \u002F Jaro-Winkler \u002F Jaccard), Base64, Pad, Text Match |\n| **Date \u002F Time** | Parse, Format, Extract Part, Date Diff \u002F Add, Truncate, Timezone Convert, Time Bin, Current Timestamp, Epoch Convert |\n| **Numeric** | Round, Modulo, Absolute, Logarithm, Power, Square Root, Bucketize, Z-Score, Clamp, Sign |\n| **JSON \u002F nested** | Parse, Stringify, Flatten, JSONPath Extract, Merge Objects, Array Aggregate |\n| **Array** | Explode \u002F Unnest, Collect List, Element At, Contains, Distinct, Length |\n| **Pivot \u002F shape** | Pivot, Unpivot, Denormalize, Normalize, Transpose |\n| **CDC \u002F SCD** | Diff Detect, SCD Type 1, SCD Type 2 (valid_from \u002F valid_to \u002F is_current), Merge \u002F Upsert, Row Hash (md5 \u002F sha1 \u002F sha256 fingerprint), Audit Stamp (`_loaded_at` \u002F `_loaded_date` \u002F `_source` \u002F `_batch_id`) |\n| **AI \u002F Search** | **Vector Similarity Search** (cosine \u002F L2 \u002F inner product over FLOAT[N] via `vss`), **Full-Text Search** (BM25 via `fts`), **Embeddings** (OpenAI-compatible `\u002Fv1\u002Fembeddings`), **LLM Transform** (per-row chat completion with `{column}` templates), **Classify** (LLM-backed, normalizes to UNKNOWN), **Text Chunker** (RAG-ready, pure local), **PII Redact** (regex - emails \u002F phones \u002F SSNs \u002F cards), **Semantic Dedupe** (cosine over precomputed embeddings) |\n| **Geospatial** | Spatial Distance (ST_Distance), Spatial Buffer (ST_Buffer), Spatial Intersects (ST_Intersects) |\n| **Debug** | Log Rows, Assert (hard-fail on SQL predicate violation) |\n\n> **All 6 AI transforms ship today.** Three need a model API (LLM, Classify, Embeddings) and ride the apiKey-in-props pattern; three are pure-local (Chunk, PII Redact, Dedupe).\n\n### Data quality (12 available)\n\nValidators split their input: passing rows continue on the main port, failures route to a **reject** port you can sink, count, or inspect.\n\n| Component | Behavior |\n|---|---|\n| **Not-Null Check** | Pass rows with no nulls in the chosen columns |\n| **Range Check** | Pass rows inside a numeric range (inclusive or exclusive) |\n| **Regex Match** | Pass rows whose column fully matches a pattern |\n| **Uniqueness Check** | Pass the first row per key; route duplicates to reject |\n| **Schema Validate** | Reject rows where any expected column is null |\n| **Column Profile** | Per-column stats (count, null %, distinct, min \u002F max, quartiles) via `SUMMARIZE` |\n| **Describe** | Column names + types of the input |\n| **Histogram** | Value frequencies for one column, most-frequent first |\n| **Standardize** | Trim + case-normalize + collapse inner whitespace, in place |\n| **Fuzzy Deduplicate** | Keep the first row per near-duplicate cluster |\n| **Record Match** | Self-join: emit pairs of rows above a similarity threshold |\n| **Address Cleanse** | Address parsing \u002F normalization (planned - needs external lib) |\n\n### Custom code (7 available)\n\n| Capability | What it does |\n|---|---|\n| **Inline SQL** | Write a `SELECT`; the upstream node is exposed as `input`, result runs as a real materialized stage |\n| **SQL Template** | Parameterized SQL with `${context.var}` substitution |\n| **SQL Routines** | Reusable, named SQL saved in the workspace |\n| **Shell** | Run any shell command; emits `{stdout, stderr, exit_code, duration_ms}`. Platform-aware default shell. Optional `timeoutMs` kills the child. |\n| **WebAssembly UDF** | Per-row WASM transform via pure-Rust `wasmi`. Sandboxed (no fs \u002F net \u002F env). Works with any WASM toolchain (Rust, AssemblyScript, C, TinyGo). |\n| **JavaScript UDF** | Per-row JS transform via pure-Rust `boa` interpreter. Sandboxed. Define a `transform(row)` function. |\n| **Python \u002F Rust UDFs** | Embedded-language stages | Planned |\n\n### Sinks (58 available)\n\n| Group | Connectors | Status |\n|---|---|---|\n| **Files** | CSV, TSV, Parquet (ZSTD), JSON, JSONL \u002F NDJSON, Excel (.xlsx), YAML, TOML, XML (configurable wrappers), Avro (schema inferred from first row). Parquet + CSV support Hive-partitioned writes | Available |\n| **Geospatial files** | GeoJSON, GeoPackage, Shapefile, KML, GPX via GDAL | Available (lazy-loaded) |\n| **Lakehouse** | Apache Iceberg (full table layout), DuckLake | Available |\n| **Embedded databases** | SQLite, DuckDB | Available |\n| **Network relational DBs** | PostgreSQL, MySQL, MariaDB, CockroachDB - modes: **overwrite**, **append**, **truncate**, **upsert** (ON CONFLICT \u002F ON DUPLICATE KEY) | Available (live CI for PG + MySQL) |\n| **Network relational DBs** | SQL Server \u002F Azure Synapse (TDS, multi-row VALUES batched), Oracle (Instant Client; INSERT ALL), ClickHouse (HTTP JSONEachRow) | Available |\n| **Network relational DBs** | IBM DB2, generic JDBC | Planned |\n| **Object storage** | S3, GCS, Azure Blob via DuckDB `httpfs` (MinIO \u002F R2 \u002F B2 via endpoint) | Available |\n| **Cloud warehouses** | MotherDuck, Snowflake (PAT or JWT RS256), BigQuery, Redshift, Databricks SQL, Azure Synapse, **DuckDB Quack** (concurrent writers to remote DuckDB via the May 2026 protocol) | Available |\n| **HTTP APIs** | REST (POST\u002FPUT\u002FPATCH batched JSON-array), Webhook (one POST per row), GraphQL mutations | Available |\n| **Email (SMTP)** | Per-row SMTP send via pure-Rust `lettre` + rustls. Plain text v1; HTML + attachments follow. | Available |\n| **NoSQL** | MongoDB (insert_many batched), Cassandra \u002F ScyllaDB (CQL), Elasticsearch \u002F OpenSearch (`_bulk` NDJSON), Redis (pipelined SET) | Available |\n| **NoSQL** | DynamoDB | Planned |\n| **Streaming** | Kafka \u002F Redpanda (`rskafka`), NATS JetStream, GCP Pub\u002FSub (REST + OAuth2), RabbitMQ (`lapin`) | Available |\n| **Streaming** | Pulsar, Kinesis | Planned |\n| **Vector \u002F AI databases** | pgvector, Pinecone (`\u002Fvectors\u002Fupsert`), Qdrant (`\u002Fpoints` PUT), Weaviate (`\u002Fv1\u002Fbatch\u002Fobjects`), Milvus (`\u002Fv1\u002Fvector\u002Finsert`) | Available |\n| **Vector \u002F AI databases** | Chroma, LanceDB | Preview (need vendor SDK) |\n\n### Control flow (14 available)\n\n| Component | What it does |\n|---|---|\n| **Replicate \u002F Tee** | Send the same data to multiple downstream outputs |\n| **Merge Streams** | Concatenate multiple input streams (UNION ALL) |\n| **Switch \u002F Conditional Split** | Route rows to `case_1..N` outputs by boolean (first match wins); `default` for unmatched |\n| **Wait \u002F Delay** | Sleep `N ms \u002F s \u002F min \u002F h` before passing rows through |\n| **Throttle** | Inter-stage delay derived from a rows-per-second target |\n| **Checkpoint** | Pass rows through and also write a parquet snapshot to a path |\n| **Dead Letter Queue** | Terminal sink for rejected rows (JSON \u002F CSV \u002F Parquet) |\n| **Run Pipeline** | Inline-execute another pipeline file (`ctl.runpipeline`) |\n| **Iterate** | Run a sub-pipeline N times with `${ITER_INDEX}` substitution |\n| **For Each** | Run a sub-pipeline once per input row with `${ITER_ITEM_\u003CFIELD>}` substitution |\n| **Try \u002F Catch** | Install a fallback sub-pipeline if the wrapped stage fails |\n| **Retry** | Per-stage retry policy (configure on Advanced tab) |\n| **Schedule** | Cron \u002F interval \u002F file-watch triggers via the orchestration crate |\n\n### Advanced settings (per-node)\n\nEvery node has an **Advanced** tab with fields the engine honours at run time:\n\n| Field | What it does |\n|---|---|\n| **Retry attempts** | Total tries on failure (1 = no retry). Sleeps `backoff * attempt` ms between attempts. |\n| **Retry backoff (ms)** | Inter-attempt sleep, linearly scaled by attempt index. |\n| **Memory limit (MB)** | `PRAGMA memory_limit` applied to this stage only. |\n| **Log row count** | Print the post-stage rowcount to the run output. |\n\n### Orchestration and workspace\n\n| Capability | What it does |\n|---|---|\n| **Run feedback** | Streaming run events light nodes up stage by stage, with per-node row counts, real mid-query cancel, and run history. |\n| **Schedules** | Cron, fixed-interval, and file-watch triggers, driven by an in-process scheduler. |\n| **Context variables** | Per-environment variables; bind any field to one via a Manual \u002F Context dropdown, or reference `${var}` inline. Resolved at run time. |\n| **Cloud credentials** | Saved S3 \u002F GCS \u002F Azure connections become DuckDB SECRETs; cloud reads \u002F writes go through `httpfs`. S3-compatible endpoints (MinIO \u002F R2 \u002F B2) supported via `ENDPOINT` + `URL_STYLE`. |\n| **Workspace** | Pipelines, connections, contexts, documents, and routines persist as plain JSON and Markdown files in a folder you choose. |\n\n---\n\n## Clean data before it reaches your AI\n\nModels inherit the quality of their inputs. RAG indexes, embedding stores, and training sets quietly accumulate duplicates, nulls, malformed rows, mixed encodings, and inconsistent schemas. Duckle is built to scrub that data before it lands in a vector store:\n\n- **Deduplicate** with exact Distinct, Uniqueness, and **Fuzzy Deduplicate** (Jaro-Winkler \u002F Levenshtein); use **Record Match** to find near-duplicate pairs with a similarity score\n- **Semantic dedupe** with `xf.ai.dedupe` over a precomputed embedding column\n- **Profile + describe** every column up front (Column Profile, Describe, Histogram) so issues surface before they reach a model\n- **Validate and filter** malformed, empty, or out-of-range records and route failures to a reject port\n- **Normalize** types, encodings, casing, and null handling across messy sources (Standardize, Cast, regex \u002F string transforms)\n- **Redact PII** (emails, phones, SSNs, credit cards) via `xf.ai.pii` before embedding\n- **Chunk + embed** long text via `xf.ai.chunk` -> `xf.ai.embed` for RAG indexing\n- **Classify** rows with an LLM (`xf.ai.classify` constrains the model to one of N user-supplied categories)\n- **Retrieve with both halves of hybrid search**, locally, no model API required: **Vector Similarity Search** (cosine \u002F L2 \u002F inner product) and **Full-Text Search** (BM25)\n- **Land it in your store** - pgvector ships, and **Pinecone**, **Qdrant**, **Weaviate**, **Milvus** all have working sinks that POST batches through each vendor's HTTP API\n\n---\n\n## Engines\n\nDuckle ships a thin shell and installs its engines on first launch.\n\n| Engine | Role | Status |\n|---|---|---|\n| **DuckDB** | Default execution engine: analytics, file formats, cloud reads, SQL pushdown. Tracking **v1.5.3** (latest stable). | Working |\n| **Duckie AI Assistant** | Local chat assistant via **llama.cpp** + **Qwen 2.5 Coder 1.5B GGUF**. Downloads ~1.1 GB; runs entirely offline once installed. Managed as a `llama-server` subprocess exposing an OpenAI-compatible API on `127.0.0.1`. | Installable |\n| **SlothDB** | Alternate embedded analytical engine ([SouravRoy-ETL\u002Fslothdb](https:\u002F\u002Fgithub.com\u002FSouravRoy-ETL\u002Fslothdb)), installed the same way and selectable per pipeline. | Installable |\n| **Native** | In-process Rust streaming \u002F incremental engine. | Planned |\n\n### First-launch extension pre-fetch\n\nWhen the installer downloads the DuckDB CLI it also pre-fetches the extensions Duckle uses, with per-extension progress, so the first time you touch a Postgres source or an Iceberg table there is no surprise network hop mid-pipeline:\n\n`httpfs` (S3 \u002F GCS \u002F HTTP), `azure` (Azure Blob native), `sqlite`, `postgres`, `mysql`, `excel`, `iceberg`, `delta`, `ducklake`, `vss`, `fts`.\n\n`spatial` is lazy-loaded (~50 MB GDAL bundle) - it installs on first use of a geospatial source\u002Fsink to keep the initial download small.\n\n---\n\n## Download \u002F Install\n\nPick the binary for your OS from the [latest release](https:\u002F\u002Fgithub.com\u002FSouravRoy-ETL\u002Fduckle\u002Freleases\u002Ftag\u002Fv0.1.0-hotfix):\n\n| OS | Asset | How to run |\n|---|---|---|\n| **Windows** | `Duckle-windows-x64.exe` | Double-click. Unsigned binary - Windows SmartScreen will warn the first time; click \"More info\" -> \"Run anyway\". |\n| **macOS** (Apple Silicon) | `Duckle-macos-arm64` | `chmod +x Duckle-macos-arm64 && .\u002FDuckle-macos-arm64`. Right-click -> Open the first time to bypass Gatekeeper. |\n| **Linux** (x86_64) | `Duckle-linux-x64` | `chmod +x Duckle-linux-x64 && .\u002FDuckle-linux-x64`. Requires WebKitGTK 4.1 (`libwebkit2gtk-4.1-0` on Debian \u002F Ubuntu). |\n\nThe binary is ~30 MB (Linux ~30, macOS ~24, Windows ~28). On first launch you'll be guided through downloading two engines into your app-data directory:\n\n| Engine | Size | Required? | What it powers |\n|---|---|---|---|\n| **DuckDB CLI** | ~30 MB + extensions | **Yes** - cannot run pipelines without it | Every source \u002F transform \u002F sink that runs as SQL |\n| **Duckie AI Assistant** | ~1.1 GB (llama-server + Qwen 2.5 Coder 1.5B GGUF) | Optional | The chat sidebar that generates pipelines from natural language |\n\nApp-data location:\n- Windows: `%APPDATA%\\io.duckle.app\\engines\\`\n- macOS: `~\u002FLibrary\u002FApplication Support\u002Fio.duckle.app\u002Fengines\u002F`\n- Linux: `~\u002F.config\u002Fio.duckle.app\u002Fengines\u002F`\n\nDelete the `engines\u002F` folder if you ever want to force a fresh install.\n\n---\n\n## Quickstart (60 seconds)\n\n1. **Download** the binary for your OS (see [Download \u002F Install](#download--install) above) - or [build from source](#build-from-source).\n2. **Launch it.** First run shows the setup modal:\n   - Click **Install** on DuckDB (required, takes ~30 s).\n   - Optionally click **Install** on Duckie AI Assistant (~1.1 GB, takes 5-10 min on average broadband).\n3. **Pick a workspace folder.** Pipelines, connections, context variables, and routines live there as plain files.\n4. **Build a pipeline two ways:**\n   - **Drag + wire**: drag a **CSV source** in, point it at [`samples\u002Forders.csv`](samples\u002Forders.csv), hit **Autodetect schema**. Drag a **Filter**, wire it up. Drag a **Parquet sink** with an output path. Press **Run**, watch the nodes light up.\n   - **Ask Duckie**: click the **Sparkles** icon (top-right of the toolbar), type *\"read orders.csv, filter where status = 'paid', write to paid.parquet\"*. When Duckie streams back a pipeline, click **Insert into canvas**.\n5. **Inspect.** Click any node to see its generated SQL in the **Plan** tab and a live row sample in the **Preview** tab.\n\nThat's a real, native ETL pipeline built and run in under a minute. CSV is just the easiest first node; swap in Parquet, JSON, S3, Snowflake, MongoDB, or Stripe the same way.\n\n---\n\n## Run your first pipeline\n\nA worked example using the bundled `samples\u002Forders.csv` data.\n\n### 1. Add a source\n\n- Open the **Components** sidebar (left). Click **Sources -> Files -> CSV**.\n- Drag it onto the canvas.\n- In the right-side Properties panel:\n  - **Path**: browse to `samples\u002Forders.csv`\n  - Click **Autodetect schema** - the **Schema** tab fills in column types from the file, the **Preview** tab shows the first 20 rows.\n\n### 2. Add a transform\n\n- **Components -> Transforms -> Rows -> Filter**. Drag onto canvas.\n- Wire the CSV source's `main` output port to the Filter's `main` input.\n- In Properties:\n  - **Predicate**: `status = 'paid'` (you can write raw SQL or use the visual builder)\n  - Filter has two output ports: `pass` (rows matching) and `reject` (rows that don't).\n\n### 3. Add a sink\n\n- **Components -> Sinks -> Files -> Parquet**.\n- Wire Filter's `pass` port to the Parquet sink.\n- **Path**: `paid_orders.parquet`. **Write mode**: `overwrite`. **Compression**: `zstd`.\n\n### 4. Run it\n\n- Press **Run** in the toolbar. Nodes light up in execution order; row counts appear under each.\n- Open the **Output** tab (bottom panel) to see per-stage timing.\n- Click any node to inspect generated SQL in **Plan** + sampled rows in **Preview**.\n\n### 5. Iterate\n\n- Add a **Group By** before the sink to aggregate. Re-run. Sub-second on small data.\n- Cancel mid-run with the **Stop** button - the DuckDB process is killed cleanly.\n- Save your work: **Cmd\u002FCtrl-S** writes a JSON pipeline file to your workspace folder.\n\n---\n\n## How to use Duckle\n\nA wider tour of the workflow.\n\n| Step | What you do | Where to look |\n|---|---|---|\n| **1. Sources** | Drag a source, point it at a file \u002F DB \u002F cloud URL \u002F SaaS endpoint. Click **Autodetect schema** to read columns + a sample. | [Sources reference](#sources-74-available) |\n| **2. Transforms** | Wire transforms to source output ports. Configure in the Properties panel. **Preview** tab shows live rows; **Plan** tab shows generated SQL. | [Transforms reference](#transforms-126-available) |\n| **3. Data quality** | Drop in a validator (Not-Null, Range, Regex, Uniqueness). Passing rows continue on the main port; failures route to the **reject** port. | [Data quality reference](#data-quality-12-available) |\n| **4. Sinks** | Finish with a sink (file, DB, cloud, vector DB, message bus, email). Set write mode (overwrite, append, truncate, upsert). | [Sinks reference](#sinks-58-available) |\n| **5. Run** | Press **Run** to execute on DuckDB. Nodes light up stage by stage; **Output** + **Console** show row counts, timing, errors. Stop button kills mid-run. | [Run feedback](#orchestration-and-workspace) |\n| **6. Ask Duckie** | For anything you can describe in English, the AI assistant can sketch a pipeline. Iterate by editing the graph or asking follow-ups. | [Meet Duckie](#meet-duckie---the-local-ai-pipeline-assistant) |\n| **7. Reuse** | Save Connections, Context variables, and SQL Routines in the workspace; reference `${context.var}` in any field. Everything persists as plain files. | [Workspace and Git flow](#workspace-and-git-flow) |\n| **8. Schedule** | Attach a cron, interval, or file-watch trigger to run a pipeline automatically. | [Schedules and triggers](#schedules-and-triggers) |\n\n---\n\n## Recipes and examples\n\nReady-to-adapt patterns. Each one is a few nodes you wire on the canvas (or ask Duckie to sketch).\n\n### CSV cleanup\n\n> \"Read orders.csv, drop nulls, deduplicate by order_id, write to orders_clean.parquet\"\n\n```\nsrc.csv -> qa.not_null -> qa.uniqueness -> snk.parquet\n```\n\nSet `qa.not_null` to the columns that must be present; set `qa.uniqueness` to `order_id`. Rejected rows go to a `snk.csv` on the `reject` port for inspection.\n\n### Postgres -> Snowflake nightly load\n\n> \"Read all rows from Postgres `events`, upsert into Snowflake table `analytics.events` on `event_id`\"\n\n```\nsrc.postgres -> snk.snowflake (mode=upsert, conflict=event_id)\n```\n\nAttach a `ctl.schedule` with cron `0 2 * * *` to run nightly at 02:00.\n\n### S3 -> partitioned Parquet\n\n> \"Read all .json.gz files in `s3:\u002F\u002Flogs\u002F2026\u002F*\u002F*.json.gz`, parse, write Hive-partitioned by `event_date`\"\n\n```\nsrc.s3 (glob, autodetect json.gz)\n  -> xf.derive (event_date = CAST(ts AS DATE))\n  -> snk.parquet (path=out\u002F, partitionBy=event_date, mode=overwrite_or_ignore)\n```\n\n### RAG ingestion\n\n> \"Chunk our docs, embed with OpenAI, dedupe near-identicals, store in pgvector\"\n\n```\nsrc.s3 (markdown files)\n  -> xf.ai.chunk (chunkSize=1500, overlap=150)\n  -> xf.ai.pii (redact)\n  -> xf.ai.embed (model=text-embedding-3-small, baseUrl=https:\u002F\u002Fapi.openai.com)\n  -> xf.ai.dedupe (threshold=0.95)\n  -> snk.pgvector (table=docs)\n```\n\n### Slack channel digest\n\n> \"Pull yesterday's Slack messages from #support, classify by sentiment, email a summary\"\n\n```\nsrc.slack (channels.history with oldest=yesterday)\n  -> xf.ai.classify (categories=positive,negative,neutral)\n  -> xf.aggregate (group by sentiment, count)\n  -> snk.email (to=oncall@..., subject=Daily Support Digest)\n```\n\n### Webhook -> S3 archive\n\n> \"Receive 100 webhooks, archive each one as JSON in S3\"\n\n```\nsrc.webhook (port=8080, maxRequests=100, timeoutMs=300000)\n  -> snk.s3 (path=s3:\u002F\u002Farchive\u002Fevents\u002F, format=jsonl, partitionBy=event_date)\n```\n\n### Git commit-log analytics\n\n> \"Build a dashboard of who's been committing what in the last 30 days\"\n\n```\nsrc.git (mode=log, maxRows=10000)\n  -> xf.filter (date > current_date - INTERVAL '30 days')\n  -> xf.aggregate (group by author_email, count)\n  -> snk.csv (path=author-stats.csv)\n```\n\nMore examples live in [`samples\u002F`](samples) - drop the pipeline files into a workspace and open them.\n\n---\n\n## Git integration (GitHub + GitLab)\n\n> Push, pull, branch, and watch CI from inside Duckle. No terminal required.\n\nClick the **Git icon** in the topbar to open the workspace Git panel. Talend-style integration with GitHub and GitLab, built on the system `git` CLI (no FFI, no embedded git library):\n\n| Feature | What it does |\n|---|---|\n| **Status snapshot** | Current branch, ahead\u002Fbehind counts, list of modified \u002F staged \u002F untracked \u002F conflicted files |\n| **Stage all + commit** | One-click `git add -A && git commit -m \"...\"` with your message |\n| **Push \u002F Pull** | `git push` and `git pull --ff-only` against `origin`. The button stays disabled when there's nothing to push |\n| **Branch list, switch, create** | Lists local branches; click to switch; create new branches inline |\n| **Remote URL config** | Add or change `origin` URL from inside the panel - auto-detects GitHub vs GitLab from the host |\n| **PAT-prompt fallback** | First tries `git push` using your system credential helper (GitHub CLI, osxkeychain, manager-core). On a 401, prompts for a Personal Access Token, saves it AES-encrypted in `\u003Cworkspace>\u002F.duckle\u002Fsecrets\u002Fgit.json` (auto-gitignored), retries with the token injected into the HTTPS URL |\n| **CI build badge in topbar** | Polls GitHub Actions or GitLab CI every 30 s for the latest pipeline on your current branch. Shows green \u002F red \u002F yellow \u002F gray. Click to open the build in your browser |\n\n**Workflow.** Workspaces are plain folders (see [Workspace and Git flow](#workspace-and-git-flow)) - any standard Git workflow works:\n\n```\nCreate \u002F clone -> open in Duckle -> edit pipelines -> commit + push -> \nPR \u002F MR -> CI runs your pipeline tests -> merge -> pull\n```\n\nYou can do the entire push \u002F pull \u002F merge loop without leaving Duckle. Heavy operations (interactive rebase, conflict resolution, log archaeology) still live in your terminal or external Git tool - the panel is designed for the everyday flow, not as a full Git replacement.\n\n**Provider detection.** The remote URL host determines which CI API the badge polls:\n\n| Provider | CI source | API |\n|---|---|---|\n| `github.com` | GitHub Actions | `GET \u002Frepos\u002F{owner}\u002F{repo}\u002Factions\u002Fruns` |\n| `gitlab.com` or self-hosted GitLab | GitLab CI | `GET \u002Fapi\u002Fv4\u002Fprojects\u002F{id}\u002Fpipelines` |\n| Other \u002F bitbucket | (no CI badge for now) | - |\n\nThe badge uses the same PAT you saved for pushes - no separate auth step.\n\n---\n\n## Workspace and Git flow\n\nA workspace is a folder you pick on first launch. Everything you build lives there as plain text:\n\n```\nmy-workspace\u002F\n  pipelines\u002F\n    orders_etl.pipeline.json     # the node graph\n    nightly_load.pipeline.json\n  connections\u002F\n    prod-postgres.connection.json # saved DB credentials (encrypted)\n    snowflake-analytics.connection.json\n  contexts\u002F\n    dev.context.json              # variables for dev environment\n    prod.context.json\n  routines\u002F\n    cleanse-addresses.sql         # reusable SQL snippets\n  documents\u002F\n    runbook.md                    # plain-Markdown docs\n  schedules.json                  # all scheduled runs in this workspace\n  run-history\u002F\n    orders_etl\u002F                   # one folder per pipeline\n      2026-05-25T14-30-00.json    # one file per run\n```\n\n**Git-friendly by design.** Every file is human-readable JSON or Markdown. Standard workflows work:\n\n```bash\ngit init my-workspace && cd my-workspace\ngit add . && git commit -m \"Initial pipelines\"\n\n# Pull a teammate's update\ngit pull --rebase\n\n# Push your changes\ngit push\n\n# Branch for a risky migration\ngit checkout -b feature\u002Fupsert-mode\n# ...edit pipelines in Duckle...\ngit diff       # readable JSON diffs\ngit push -u origin feature\u002Fupsert-mode\n# open PR \u002F MR\n```\n\n**Sensitive values** in connections get encrypted with a workspace-local key (`workspace\u002F.duckle\u002Fkeys\u002F`). Don't commit that file - add `**\u002F.duckle\u002Fkeys\u002F` to `.gitignore`. The connection JSON files themselves only hold the ciphertext, which is safe.\n\n---\n\n## Schedules and triggers\n\nPipelines can run on cron, fixed interval, or file-watch triggers. Configure these in the **Schedule panel** (toolbar -> Schedule icon), not as graph nodes.\n\n| Trigger type | Config | Example |\n|---|---|---|\n| **Cron** | Standard 5-field cron expression with optional timezone | `0 2 * * *` (every day at 2 AM) |\n| **Interval** | `every N {seconds, minutes, hours, days}` | `every 15 minutes` |\n| **File watch** | Watch a directory for new\u002Fchanged files matching a glob | `\u002Finbox\u002F*.csv` |\n| **Manual** | Run-on-demand only (the default) | - |\n\nSchedules persist to `workspace\u002Fschedules.json` and execute via the in-process scheduler crate. They survive app restarts but require Duckle to be running.\n\nFor headless \u002F always-on schedules, run the same pipeline from a system cron \u002F systemd timer \u002F Windows Scheduled Task that invokes:\n\n```bash\nduckle run --workspace ~\u002Fdata --pipeline orders_etl\n```\n\n(CLI run mode is a planned 1.0 feature - tracked in [docs\u002Froadmap.md](docs\u002Froadmap.md).)\n\n---\n\n## Connection management\n\nSaved connections become DuckDB secrets at runtime so credentials never leak into the pipeline JSON.\n\n| Type | Stored fields | Used by |\n|---|---|---|\n| **PostgreSQL \u002F MySQL \u002F etc.** | host, port, user, password, database, ssl mode | `src.postgres`, `snk.postgres`, ... |\n| **Snowflake** | account, user, role, warehouse, PAT or JWT private key | `src.snowflake`, `snk.snowflake` |\n| **S3 \u002F GCS \u002F Azure** | access key, secret, region (or service-account JSON) | All cloud sources\u002Fsinks via `httpfs` |\n| **MotherDuck \u002F Databricks \u002F BigQuery** | token, workspace URL | Respective sources\u002Fsinks |\n| **Generic REST \u002F SaaS** | base URL, auth scheme (Bearer \u002F API key \u002F Basic), token, custom headers | All REST aliases |\n\nConnections live in `workspace\u002Fconnections\u002F` as JSON. The token\u002Fpassword field is encrypted with the workspace key; the rest is plain text.\n\nTo use a connection in a pipeline, the Properties panel of any compatible source\u002Fsink shows a **Connection** dropdown - pick one and the fields auto-fill.\n\n---\n\n## Context variables\n\nBind any field to a context variable that resolves at run time. Useful for `dev` vs `prod`, per-environment paths, secrets injected from CI, etc.\n\nIn a context file (`workspace\u002Fcontexts\u002Fprod.context.json`):\n\n```json\n{\n  \"name\": \"prod\",\n  \"vars\": {\n    \"DB_HOST\": \"db.internal.acme.com\",\n    \"S3_BUCKET\": \"acme-prod-data\",\n    \"BATCH_SIZE\": \"10000\"\n  }\n}\n```\n\nIn the Properties panel of any node, switch a field from **Manual** to **Context** and pick `DB_HOST`. Or inline-reference one with `${DB_HOST}` in a string field.\n\nPick the active context from the topbar's **Context** dropdown. Switch contexts and re-run without editing the pipeline.\n\n---\n\n## Build from source\n\n**Prerequisites**\n\n- [Rust](https:\u002F\u002Frustup.rs\u002F) (stable)\n- [Node.js](https:\u002F\u002Fnodejs.org\u002F) 18+ and npm\n- [`cargo-tauri`](https:\u002F\u002Ftauri.app\u002F) CLI: `cargo install tauri-cli --version \"^2\"`\n- Platform webview dependencies per the [Tauri prerequisites](https:\u002F\u002Ftauri.app\u002Fstart\u002Fprerequisites\u002F). WebView2 is preinstalled on Windows 10 and 11.\n\n**Clone and install**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FSouravRoy-ETL\u002Fduckle\ncd duckle\nnpm --prefix frontend install\n```\n\n**Run in development** (hot-reloading frontend plus the native shell):\n\n```bash\ncargo tauri dev\n```\n\n**Build a release binary:**\n\n```bash\n# The --features custom-protocol flag is required: without it, tauri-codegen\n# embeds the dev URL instead of the bundled frontend.\ncargo build --release --manifest-path apps\u002Fdesktop\u002FCargo.toml --features custom-protocol\n```\n\nOutputs land in `target\u002Frelease\u002Fduckle` (or `duckle.exe`). The engine is not statically linked: DuckDB downloads at first launch, which is why the build is fast and the binary is tiny.\n\n**Run the tests:**\n\n```bash\ncargo test                                                          # workspace unit + plan tests\nDUCKLE_DUCKDB_BIN=\u002Fpath\u002Fto\u002Fduckdb cargo test -p duckle-duckdb-engine # full integration suite\n```\n\n---\n\n## Architecture\n\n```\nduckle\u002F\n  apps\u002Fdesktop\u002F         Tauri 2 shell: Tauri commands, engine installer, llama runtime, window\n  frontend\u002F             React 19 + Vite + TypeScript: the designer UI + chat panel\n  crates\u002F\n    duckdb-engine\u002F      Compiles the node graph to SQL and drives the DuckDB CLI\n    slothdb-engine\u002F     SlothDB adapter\n    scheduler\u002F          Cron \u002F interval \u002F file-watch triggers\n    metadata\u002F           Schema and type model\n    plugin-sdk\u002F         Connector \u002F inspector traits\n    connectors\u002F         Source and sink connectors\n    runtime, workflow-engine, transform-engine, stream-engine, execution-core\n```\n\n- The **frontend** (React with [@xyflow\u002Freact](https:\u002F\u002Freactflow.dev\u002F)) is the visual designer; it talks to the Rust core over Tauri commands.\n- **duckdb-engine** topologically sorts the graph, lowers each node into SQL, and executes by shelling out to the downloaded DuckDB CLI. Non-sink nodes materialize as tables so later stages can reference them; sinks become `COPY ... TO` statements; cancel kills the process. No statically linked database, so the binary stays small.\n- **Duckie** is a `llama-server` subprocess on `127.0.0.1` exposing an OpenAI-compatible chat-completions API. The chat panel streams from it via SSE. The model is sandboxed: no fs, no net, no tools - it can only emit text.\n- **Everything persists** to the workspace folder you choose, as plain JSON and Markdown files.\n\n---\n\n## Configuration\n\nA few knobs you can set without touching code.\n\n| Setting | Where | Effect |\n|---|---|---|\n| **Theme** | Topbar sun\u002Fmoon toggle | Light \u002F dark, persisted to `localStorage` |\n| **Workspace** | Topbar workspace pill -> Switch | Change the folder Duckle reads\u002Fwrites to |\n| **Active engine** | Topbar engine selector | DuckDB (default) or SlothDB - per-pipeline |\n| **Active context** | Topbar context dropdown | Switches which context variables resolve at run time |\n| **AI Assistant baseURL** | `xf.ai.llm` \u002F `xf.ai.embed` \u002F `xf.ai.classify` props | Point at any OpenAI-compatible endpoint (default: Duckie's local llama-server) |\n| **Per-stage retry** | Properties panel -> Advanced tab | Total attempts + linear-scaled backoff per stage |\n| **Per-stage memory cap** | Properties panel -> Advanced tab | `PRAGMA memory_limit` applied just to that stage |\n| **DuckDB extensions** | Pre-fetched at install; lazy-loaded for `spatial` | See [First-launch extension pre-fetch](#first-launch-extension-pre-fetch) |\n| **Env var `RUST_LOG`** | Before launching the binary | `RUST_LOG=debug duckle.exe` to see verbose engine logs |\n| **Env var `DUCKLE_DUCKDB_BIN`** | Before running engine tests | Points the integration test suite at a DuckDB CLI |\n\n---\n\n## Performance tips\n\nA few patterns that consistently produce sub-second runs at small \u002F medium data scale, and tractable runs at warehouse scale.\n\n| Tip | Why |\n|---|---|\n| **Use Parquet, not CSV, for intermediate steps** | Columnar + compressed; DuckDB reads only the columns the next stage needs. CSV is fine for source \u002F sink at the edges. |\n| **Push filters as early as possible** | `xf.filter` early in the graph compiles to a `WHERE` that runs at scan time, not a post-scan filter. |\n| **Use the `vss` + `fts` indexes** | Vector + full-text search hit DuckDB extensions directly. Faster than the alternative of pulling data out and indexing in Python. |\n| **Avoid per-row API calls when batch APIs exist** | `xf.ai.embed` batches up to 100 inputs per request; `snk.rest` defaults to one batched request. Per-row patterns (`xf.ai.llm`, `snk.webhook`) are slower by design - use them when you actually need per-row behavior. |\n| **Cap heavy aggregates with the per-stage memory limit** | Properties panel -> Advanced -> Memory limit (MB) prevents one big GROUP BY from blowing through all of RAM. |\n| **Use `ctl.checkpoint` for long-running pipelines** | A checkpoint stage writes a Parquet snapshot to a path you choose, so a future run can resume from there with `src.parquet`. |\n| **Disable `xf.debug.log` in prod** | Logging rows is per-row I\u002FO; fine for dev, costly at scale. |\n| **Sort once at the end, not in the middle** | `xf.sort` is a global sort; doing it once before the sink avoids re-sorting downstream. |\n\n---\n\n## FAQ\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Is Duckle free? What's the license?\u003C\u002Fb>\u003C\u002Fsummary>\n\nYes, free + open source. Dual-licensed **MIT OR Apache-2.0**. You can use it commercially, fork it, sell what you build with it. No usage limits, no telemetry.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Does Duckle send my data anywhere?\u003C\u002Fb>\u003C\u002Fsummary>\n\nNo. The app runs entirely on your machine. The engines (DuckDB, llama.cpp) are downloaded from official upstream releases on first launch and then run locally. The only network calls Duckle makes on your behalf are the ones your pipelines explicitly do (e.g. a `src.s3` reading from your S3 bucket, or `xf.ai.embed` if you configure it to hit OpenAI).\n\nDuckie AI Assistant runs **fully offline** once the model is downloaded.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>How big are pipelines this works well on?\u003C\u002Fb>\u003C\u002Fsummary>\n\nDuckDB is excellent on data that fits on one machine - tens of GB on a laptop, hundreds on a workstation. Beyond that, point Duckle's output at a warehouse \u002F lakehouse that scales horizontally. Duckle is honest about being single-machine.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Do I need DuckDB installed first?\u003C\u002Fb>\u003C\u002Fsummary>\n\nNo - Duckle downloads it for you on first launch. The download is ~30 MB and includes the most-used extensions (httpfs, postgres, mysql, iceberg, delta, vss, fts, etc.) so the first time you touch a Postgres source there's no mid-pipeline network pause.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>How big is the binary, exactly?\u003C\u002Fb>\u003C\u002Fsummary>\n\nAbout 27-30 MB depending on platform (Linux 30 MB, macOS 24 MB, Windows 28 MB). The engines aren't statically linked - DuckDB (~50 MB with extensions) and the Duckie LLM (~1.1 GB for the Qwen GGUF) both download on first launch with a guided installer into your app-data folder. So the actual app download stays small, updates stay fast, and engines update independently.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Can I use OpenAI \u002F Cohere \u002F Voyage instead of the local Duckie?\u003C\u002Fb>\u003C\u002Fsummary>\n\nYes. The AI transforms (`xf.ai.embed`, `xf.ai.llm`, `xf.ai.classify`) accept a `baseUrl` prop. Point it at any OpenAI-compatible `\u002Fv1\u002F...` endpoint and an `apiKey` and Duckle uses that instead. The local Duckie chat panel is hardwired to localhost; the pipeline AI transforms are configurable.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Where does my pipeline data live?\u003C\u002Fb>\u003C\u002Fsummary>\n\nIn the workspace folder you pick on first launch (see [Workspace and Git flow](#workspace-and-git-flow)). Pipelines are plain JSON files you can commit to Git, diff, branch, and review.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Can multiple people collaborate on the same workspace?\u003C\u002Fb>\u003C\u002Fsummary>\n\nVia Git, yes - check the workspace into a repo and use standard branch\u002FPR flows. Duckle does not have a real-time multiplayer mode (single-machine by design).\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Can I run pipelines headlessly \u002F from CI?\u003C\u002Fb>\u003C\u002Fsummary>\n\nCLI run mode (`duckle run --workspace ~\u002Fdata --pipeline orders_etl`) is on the 1.0 roadmap. Today you can run pipelines via the desktop app on a schedule, or by importing the engine crate (`duckle-duckdb-engine`) into your own Rust binary.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Is the Duckie AI assistant any good?\u003C\u002Fb>\u003C\u002Fsummary>\n\nFor 90% of common pipelines (read source -> simple transforms -> sink), yes - the Qwen 2.5 Coder model is tuned for structured-JSON generation. For long, complex pipelines you'll likely want to iterate: describe the first half, click insert, then ask for the next half. You can also swap the model: point `xf.ai.llm`'s `baseUrl` at GPT-4 or Claude for more capable pipeline drafting.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Does the Duckie panel need internet after install?\u003C\u002Fb>\u003C\u002Fsummary>\n\nNo. Once `llama-server` and the Qwen GGUF are downloaded into your app-data directory, Duckie runs fully offline. Tested by killing wifi and asking it for a pipeline - works fine.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Why DuckDB and not Polars \u002F Apache Spark \u002F X?\u003C\u002Fb>\u003C\u002Fsummary>\n\nDuckDB's SQL surface is wide enough to express most ETL work, it's vectorized and fast on a laptop, it has first-class Iceberg\u002FDelta\u002FParquet readers, and its extension model lets us add vector + full-text + Postgres ATTACH without code changes. Polars is great but doesn't ship the cloud\u002Fformat\u002Fextension breadth we need; Spark is a great cluster but overkill for the local-first niche we're in.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>How do I contribute a new connector?\u003C\u002Fb>\u003C\u002Fsummary>\n\nSee the [Contributing](#contributing) section and `crates\u002Fduckdb-engine\u002Fsrc\u002Fplan.rs` (planner branch) + `crates\u002Fduckdb-engine\u002Fsrc\u002Flib.rs` (executor). The shortest path: copy an existing connector with similar shape (e.g. `src.rabbit` for a streaming source, `src.dynamodb` for an HTTP+auth API), adapt, add a test, flip the palette tile.\n\n\u003C\u002Fdetails>\n\n---\n\n## Troubleshooting\n\n| Symptom | Likely cause | Fix |\n|---|---|---|\n| **Window opens but content shows \"localhost refused to connect\"** | Release binary built without `--features custom-protocol` (the v0.0.7 bug) | Rebuild with `cargo build --release --features custom-protocol` per [Build from source](#build-from-source). The release workflow already passes this flag. |\n| **\"DuckDB CLI not found\"** on Run | First-launch installer was skipped or interrupted | Open the engine setup modal from the toolbar; click Install on DuckDB |\n| **\"Couldn't download Duckie AI Assistant (HTTP 404)\"** | Pinned llama.cpp build temporarily unavailable from upstream | Bump `LLAMACPP_BUILD` in `apps\u002Fdesktop\u002Fsrc\u002Fengine_manager.rs` to a recent stable, rebuild |\n| **Linux: app won't launch, missing libwebkit** | WebKitGTK 4.1 isn't installed | `sudo apt install libwebkit2gtk-4.1-0` (Debian\u002FUbuntu) or your distro's equivalent |\n| **macOS: \"App can't be opened because Apple cannot check it\"** | Gatekeeper, unsigned binary | Right-click the binary -> Open -> Open Anyway |\n| **Pipeline runs but a connector errors with \"extension not loaded\"** | Lazy-loaded extension (e.g. `spatial`) downloaded mid-run and failed | Run `duckdb :memory: -c \"INSTALL spatial; LOAD spatial;\"` from a terminal to pre-install; relaunch Duckle |\n| **Chat panel says \"AI engine not registered\"** | Old version of Duckle before AI shipped (pre-v0.0.10) | Update to latest release |\n| **Duckie generates a pipeline but Insert doesn't put anything on the canvas** | Active pipeline tab has been closed; nothing to insert into | Open a pipeline (or create a new one) before clicking Insert |\n| **MotherDuck \u002F Snowflake auth fails** | Token expired, or PAT lacks the role you're trying to use | Regenerate in the vendor UI; paste into the Connection in Duckle |\n| **Postgres `ATTACH` says \"could not connect\"** | Local SSL mode mismatch | Connection -> Advanced -> set SSL mode to `disable` for localhost \u002F `require` for production |\n| **AI tests skip with no failure** | `DUCKLE_DUCKDB_BIN` isn't set | `export DUCKLE_DUCKDB_BIN=\u002Fpath\u002Fto\u002Fduckdb` before `cargo test` |\n\nIf you see something not listed, please [open an issue](https:\u002F\u002Fgithub.com\u002FSouravRoy-ETL\u002Fduckle\u002Fissues) with steps to reproduce + the relevant log line.\n\n---\n\n## CI \u002F CD\n\nDuckle's CI pipeline runs on **both GitHub and GitLab** - the project mirrors to both. Push \u002F pull-request \u002F merge-request \u002F tag events all trigger builds.\n\n| Trigger | GitHub Actions | GitLab CI |\n|---|---|---|\n| **Push to main or feature branch** | `.github\u002Fworkflows\u002Fci.yml` | `.gitlab-ci.yml` (`test` + `desktop-build` stages) |\n| **Pull request \u002F merge request** | `.github\u002Fworkflows\u002Fci.yml` | `.gitlab-ci.yml` (same stages, `rules:` gate on MR events) |\n| **Tag `v*`** | `.github\u002Fworkflows\u002Frelease.yml` | `.gitlab-ci.yml` (`release` stage; uploads binaries to GitLab Releases) |\n\nWhat each pipeline does:\n\n1. **Frontend** - `npm ci` + `npm run build` (type-check + bundle)\n2. **Rust test matrix** - `cargo test --workspace` on Linux + macOS + Windows\n3. **Live-service integration tests** - PostgreSQL + MySQL + MinIO services spun up via Docker, real connector code runs against them\n4. **Desktop release-build smoke check** - `cargo build --release --features custom-protocol` then grep the binary for the embedded frontend JS chunk (catches the v0.0.7-class \"binary loads devUrl\" bug at PR time)\n5. **Format + clippy** - informational (does not block merge)\n6. **On tag**: build the Duckle binary on all three OSes, upload as release assets\n\nSee [`.github\u002Fworkflows\u002F`](.github\u002Fworkflows\u002F) and [`.gitlab-ci.yml`](.gitlab-ci.yml) for the exact steps. The two pipelines are kept feature-equivalent so contributors can fork to either platform.\n\n### Releasing a new version\n\n```bash\n# 1. Bump version in apps\u002Fdesktop\u002Ftauri.conf.json\n# 2. Commit\ngit commit -am \"Release: bump to vX.Y.Z\"\n# 3. Tag + push\ngit tag vX.Y.Z\ngit push origin main vX.Y.Z\n# Both GitHub Actions and GitLab CI pick up the tag and build the\n# release artifacts automatically. Once green, the draft release on\n# GitHub gets the binaries uploaded; un-draft + mark Latest with:\ngh release edit vX.Y.Z --draft=false --latest\n```\n\n---\n\n## Roadmap\n\nA complete planned-component breakdown lives in [`docs\u002Froadmap.md`](docs\u002Froadmap.md). Highlights:\n\n- [ ] **Multi-shard Kinesis** and **Pulsar** streaming (Pulsar blocked on `protoc` at build time)\n- [ ] **Apache ORC** read \u002F write (blocked on the Arrow version conflict between `orc-rust` and our workspace pin)\n- [ ] **SFTP** source (blocked on the `aws-lc-sys` NASM build dep in `russh`)\n- [ ] **OAuth-heavy SaaS** (Google Sheets, Excel Online, full Salesforce OAuth, Gmail \u002F O365 IMAP)\n- [ ] **Embedded Python \u002F Rust** code stages (current code.* family: SQL, Shell, JavaScript, WebAssembly all ship)\n- [ ] **Hosted documentation site**\n- [ ] **Plugin marketplace** via the connector SDK\n- [ ] **In-process Native engine** - a Rust streaming \u002F incremental executor as an alternative to shelling out to the DuckDB CLI\n\n---\n\n## Contributing\n\nContributions, issues, and ideas are welcome. Duckle is young and there is a lot of green field. Open an issue to discuss a change before a large PR, match the existing code style, and keep changes focused. Run `cargo test` and `npm --prefix frontend run build` before submitting. See [CONTRIBUTING.md](CONTRIBUTING.md).\n\n---\n\n## License\n\nLicensed under either of **MIT** or **Apache-2.0** at your option.\n\n---\n\n\u003Cdiv align=\"center\">\n\u003Csub>Built with Rust, Tauri, React, and DuckDB by \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSouravRoy-ETL\">Sourav Roy\u003C\u002Fa>\u003C\u002Fsub>\n\u003C\u002Fdiv>\n\n\u003C!-- Suggested GitHub topics: etl, elt, data-engineering, data-pipeline, duckdb, rust, tauri, react, typescript, local-first, embedded, drag-and-drop, data-cleaning, vector-database, ai, ai-assistant, llm, llama-cpp, qwen, desktop-app, no-code, low-code, sql, pipeline-builder -->\n","Duckle 是一个本地优先的 ETL\u002FELT 工作室，提供拖放式的可视化管道设计工具，编译为 SQL 并在 DuckDB 上运行。其核心功能包括超过 290 个连接器、50 多种转换、内置调度器以及一个完全基于本地 CPU 运行的人工智能助手 Duckie。项目采用 Rust 编写，并使用 Tauri 和 React 构建桌面应用，体积小巧（约 30 MB），无需云服务或服务器支持，适合需要高效数据处理且注重隐私保护的场景，如个人开发者、小型团队的数据集成和分析任务。",2,"2026-06-11 03:57:39","CREATED_QUERY"]