[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79942":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":43,"readmeContent":44,"aiSummary":45,"trendingCount":16,"starSnapshotCount":16,"syncStatus":46,"lastSyncTime":47,"discoverSource":48},79942,"arnio","im-anishraj\u002Farnio","im-anishraj","C++ accelerated data quality toolkit for Python: CSV parsing, cleaning, schema validation, profiling, and pandas integration.","https:\u002F\u002Farniolib.vercel.app\u002F",null,"Python",90,403,3,248,0,1,7,49.02,"MIT License",false,"main",true,[25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42],"cpp","csv","csv-parser","data-cleaning","data-engineering","data-quality","data-science","good-first-issue","gssoc","gssoc-2026","gssoc26","high-performance","pandas","pybind11","pypi","python","python-library","schema-validation","2026-06-12 04:01:25","\u003Cdiv align=\"center\">\n\n\u003Cbr>\n\n\u003Cpicture>\n  \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"final-icon-dark.svg\">\n  \u003Cimg alt=\"Arnio\" src=\"final-icon-light.svg\" width=\"280\">\n\u003C\u002Fpicture>\n\n\u003Cbr>\u003Cbr>\n\n### Fast data preparation for the Python data stack.\n\n\u003Cbr>\n\n**Arnio** is a compiled C++ data preparation engine for messy CSV and pandas workflows.\u003Cbr>\nIt parses, infers types, strips whitespace, deduplicates, validates, and profiles data —\u003Cbr>\nthen hands clean results back to the tools you already use.\u003Cbr>\nUse Arnio _before_ and _alongside_ pandas, NumPy, scikit-learn, DuckDB, and Arrow.\n\n\u003Cbr>\n\n\u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Farnio\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Farnio?style=flat-square&logo=pypi&logoColor=white&labelColor=0d1117&color=3572A5\" alt=\"PyPI\">\u003C\u002Fa>&nbsp;\n\u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Farnio\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Farnio?style=flat-square&logo=python&logoColor=white&labelColor=0d1117&color=3572A5\" alt=\"Python\">\u003C\u002Fa>&nbsp;\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fim-anishraj\u002Farnio\u002Factions\u002Fworkflows\u002Fci.yml\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fim-anishraj\u002Farnio\u002Fci.yml?branch=main&label=CI&style=flat-square&logo=github&labelColor=0d1117&color=2ea44f\" alt=\"CI\">\u003C\u002Fa>&nbsp;\n\u003Ca href=\"https:\u002F\u002Fcodecov.io\u002Fgh\u002Fim-anishraj\u002Farnio\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fcodecov\u002Fc\u002Fgithub\u002Fim-anishraj\u002Farnio?style=flat-square&logo=codecov&labelColor=0d1117&color=2ea44f\" alt=\"Coverage\">\u003C\u002Fa>&nbsp;\n\u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-blue?style=flat-square&labelColor=0d1117\" alt=\"MIT\">\u003C\u002Fa>&nbsp;\n\u003Ca href=\"https:\u002F\u002Fgssoc.girlscript.tech\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGSSoC-2026-ff6b35?style=flat-square&labelColor=0d1117\" alt=\"GSSoC 2026\">\u003C\u002Fa>&nbsp;\n\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FxsEw7r78M\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join%20Community-5865F2?style=flat-square&logo=discord&logoColor=white&labelColor=0d1117\" alt=\"Join Discord\">\u003C\u002Fa>\n[![PyPI Downloads](https:\u002F\u002Fstatic.pepy.tech\u002Fpersonalized-badge\u002Farnio?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https:\u002F\u002Fpepy.tech\u002Fprojects\u002Farnio)\n\n\u003Cbr>\u003Cbr>\n\n```bash\npip install arnio\n```\n\nColab install smoke test: **[COLAB_SMOKE_TEST.md](COLAB_SMOKE_TEST.md)**\n\n\u003Cbr>\n\n\u003Ca href=\"#-quickstart\">Quickstart\u003C\u002Fa>&ensp;·&ensp;\u003Ca href=\"#-integrations\">Integrations\u003C\u002Fa>&ensp;·&ensp;\u003Ca href=\"#-why-arnio-exists\">Why Arnio\u003C\u002Fa>&ensp;·&ensp;\u003Ca href=\"#%EF%B8%8F-architecture\">Architecture\u003C\u002Fa>&ensp;·&ensp;\u003Ca href=\"#-benchmarks\">Benchmarks\u003C\u002Fa>&ensp;·&ensp;\u003Ca href=\"#-community\">Community\u003C\u002Fa>&ensp;·&ensp;\u003Ca href=\"#-contribute\">Contribute\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## ⚡ Quickstart\n\nIf you're new to Arnio, the example below demonstrates a simple first-run workflow for loading, cleaning, and preparing CSV data before converting it back into a pandas DataFrame.\nThe workflow starts by loading a CSV dataset into an Arnio frame for preprocessing and cleaning.\n\n```python\nimport arnio as ar\n\n# Load CSV directly through C++ — no Python parsing overhead\nframe = ar.read_csv(\"messy_sales_data.csv\")\n\n# Strict mode (default) fails on inconsistent row widths\nframe = ar.read_csv(\"messy_sales_data.csv\", mode=\"strict\")\n\n# Permissive mode fills missing trailing values with nulls\nframe = ar.read_csv(\"messy_sales_data.csv\", mode=\"permissive\")\n```\n\nEach pipeline step applies a specific transformation such as trimming whitespace, normalizing text formatting, handling missing values, and removing duplicate rows.\n\n\n```python\n# Declare what clean data looks like — arnio handles the rest\nclean = ar.pipeline(frame, [\n    (\"strip_whitespace\",),\n    (\"normalize_case\", {\"case_type\": \"lower\"}),\n    (\"fill_nulls\", {\"value\": 0.0, \"subset\": [\"revenue\"]}),\n    (\"drop_nulls\",),\n    (\"drop_duplicates\",),\n])\n```\n\nAfter preprocessing is complete, the cleaned result can be converted back into a standard pandas DataFrame for further analysis or integration with existing workflows.\n\n```python\n# Out comes a standard pandas DataFrame — use it like you always have\ndf = ar.to_pandas(clean)\n\n# Use copy=True when you need defensive pandas-owned buffers\nsafe_df = ar.to_pandas(clean, copy=True)\n```\n### Dry Run Validation\n\nUse `dry_run=True` to validate pipeline configuration and\nstep execution without returning transformed output.\n\n```python\nar.pipeline(\n    frame,\n    [\n        (\"drop_nulls\",),\n    ],\n    dry_run=True,\n)\n```\n\nNeed step timings for debugging? Opt in without changing the default pipeline return type:\n\n```python\nclean, metadata = ar.pipeline(\n    frame,\n    [(\"strip_whitespace\",), (\"drop_duplicates\",)],\n    return_metadata=True,\n)\n\nprint(metadata[\"step_timings\"])\nprint(metadata[\"applied_steps\"])\nprint(metadata[\"row_counts\"])\n```\n## Quick Example\n\n```python\nimport arnio\n\nframe = arnio.read_csv(\"sample.csv\")\n\n# Preview first 5 rows\nframe.preview(5)\n\n# Generate and view scannable summary statistics\nprint(frame.describe())\n```\n\n### Pipeline validation behavior\n\nPipeline step specifications are validated before execution begins.\n\nMalformed step tuples, invalid kwargs structures, or unknown step names fail early before any pipeline steps execute.\n\n```python\nar.pipeline(\n    frame,\n    [\n        (\"strip_whitespace\",),\n        (\"bad_step\", \"oops\", \"extra\"),\n    ],\n)\n```\n\nThis prevents partial pipeline execution when later pipeline steps are invalid.\n\n### from_dict support\n\nThis adds support for creating an ArFrame from a Python dictionary.\n\nYou can build an `ArFrame` directly from a dictionary of equal-length columns, which is useful for small inline datasets that you want to pass into a pipeline.\n\n```python\nimport arnio as ar\n\ndata = {\"name\": [\"Alice\", \"Bob\"], \"age\": [25, 30]}\n\nframe = ar.from_dict(data)\n# or\nframe = ar.ArFrame.from_dict(data)\n```\n\nAlready working with a pandas `DataFrame`? Arnio can also be integrated directly into an existing pandas workflow without changing your current data-processing approach:\n\n```python\nimport pandas as pd\nimport arnio as ar\n\ndf = pd.read_csv(\"messy_sales_data.csv\")\n\nclean_df = df.arnio.clean([\n    (\"strip_whitespace\",),\n    (\"normalize_case\", {\"case_type\": \"lower\"}),\n    (\"drop_duplicates\",),\n])\n\nreport = clean_df.arnio.profile()\n```\n## Cross-field validation rules\n\nPass a `rules` list to `Schema` for checks that span multiple columns.\nEach rule receives the full pandas `DataFrame` and must return a\n`list[ValidationIssue]` — an empty list means the rule passed.\n\n```python\nimport arnio as ar\n\ndef end_after_start(df):\n    return [\n        ar.ValidationIssue(\n            column=\"end_date\",\n            rule=\"cross_field\",\n            message=\"end_date must be >= start_date\",\n            row_index=int(i) + 1,\n        )\n        for i, row in df.iterrows()\n        if row[\"end_date\"] \u003C row[\"start_date\"]\n    ]\n\nschema = ar.Schema(\n    {\"start_date\": ar.String(), \"end_date\": ar.String()},\n    rules=[end_after_start],\n)\n\nresult = schema.validate(ar.read_csv(\"events.csv\"))\nprint(result.passed)\n```\n> **Row index convention:** `ValidationIssue.row_index` values are **1-based** and\n> count data rows only. The header row is excluded. `row_index=1` is the first data\n> row in the file.\n\n## Schema diff reports\n\nUse `diff_schema()` to compare expected and observed data contracts across\ndatasets, releases, or generated schemas.\n\n```python\nimport arnio as ar\n\nexpected = ar.Schema({\n    \"id\": ar.Int64(nullable=False, unique=True),\n    \"email\": ar.Email(nullable=False),\n})\n\nobserved = ar.Schema({\n    \"id\": ar.Int64(nullable=False),\n    \"created_at\": ar.DateTime(format=\"%Y-%m-%d\"),\n})\n\ndiff = ar.diff_schema(expected, observed)\nprint(diff.summary())\nprint(diff.to_markdown())\n```\n\n## CI data contracts (GitHub Actions)\n\nIf you want to **block schema drift** or **invalid rows** in pull requests, see\n`DATA_CONTRACT_CI.md` for an **inert copy-paste** GitHub Actions workflow example.\n\nExample contract files are included under `examples\u002Fcontracts\u002F`.\n\n### Select specific columns\n\nUse `select_columns()` to create a new `ArFrame` with only the required columns before converting to pandas.\n\n```python\nselected = ar.select_columns(frame, [\"name\", \"revenue\"])\n\nprint(selected.columns)\n# ['name', 'revenue']\n```\n\n- Preserves the requested column order.\n- Returns a new `ArFrame`.\n- Raises `ValueError` if any requested column does not exist.\n- Raises `TypeError` if `columns` is not a sequence of strings.\n\n### Handling missing values\n\nArnio supports configuring which strings are treated as null during CSV parsing using the `null_values` parameter in `read_csv` and `scan_csv`. By default, Arnio preserves its existing behavior and treats only empty cells as null. Custom matching is case-insensitive and applies to cell values only (not headers).\n\n```python\n# Default behavior: empty cells are null\nframe = ar.read_csv(\"data.csv\")\n\n# Provide a custom list of sentinels (overrides the empty-cell default)\nframe = ar.read_csv(\"data.csv\", null_values=[\"\", \"MISSING\", \"UNKNOWN\"])\n\n# Disable null sentinel handling completely\nframe = ar.read_csv(\"data.csv\", null_values=[])\n```\n\n### Handling decimal separators\n\nUse `decimal_separator` when numeric CSV data uses a separator other than\nthe default dot. This is explicit by design: Arnio does not auto-detect decimal\nformats because a comma can also be the CSV delimiter.\n\n```python\n# Semicolon-delimited CSV with unquoted European decimals\nframe = ar.read_csv(\"prices.csv\", delimiter=\";\", decimal_separator=\",\")\n\n# Comma-delimited CSV still needs quoted comma-decimal values\nframe = ar.read_csv(\"prices.csv\", decimal_separator=\",\")\n```\n\nThe default remains `decimal_separator=\".\"`, so existing dot-decimal files keep\ntheir current behavior. If you also use `thousands_separator`, it must differ\nfrom `decimal_separator`.\n\n### Handling invalid UTF-8 bytes\n\nUse `encoding_errors` to control how invalid UTF-8 bytes are handled during CSV parsing.\n\n```python\n# Raise an error on invalid UTF-8 bytes (default)\nframe = ar.read_csv(\n    \"data.csv\",\n    encoding_errors=\"strict\",\n)\n\n# Replace invalid bytes with the Unicode replacement character (�)\nframe = ar.read_csv(\n    \"data.csv\",\n    encoding_errors=\"replace\",\n)\n\n# Ignore invalid bytes completely\nframe = ar.read_csv(\n    \"data.csv\",\n    encoding_errors=\"ignore\",\n)\n```\nSupported values:\n\n- `\"strict\"` (default)\n- `\"replace\"`\n- `\"ignore\"`\n> Every step above executes in C++. Your Python code is a _configuration_ — not the execution engine.\n\n> Explore more in the **[examples\u002F](.\u002Fexamples\u002F)** folder — ready-to-run recipes for sales, customers, survey, logs, and finance datasets.\n\n\u003Cbr>\n\n### Security note: CSV formula injection\n\nArnio preserves cell values when reading CSV files. It does not rewrite strings that\nbegin with spreadsheet formula prefixes such as `=`, `+`, `-`, or `@`.\n\nIf you export Arnio-cleaned data back to CSV and expect users to open that file in\nExcel, Google Sheets, LibreOffice, or another spreadsheet application, treat\nuntrusted text fields as potentially executable spreadsheet formulas. Before\nexporting, escape or neutralize formula-like strings in user-controlled columns,\nfor example by prefixing a single quote or another project-approved escape marker.\n\nThis is especially important for customer names, notes, comments, imported form\nfields, and any other free-text values that may come from outside your trust\nboundary. Arnio focuses on parsing, validation, profiling, and cleanup; final CSV\nexport policy should stay explicit in the application that writes the file.\n\n\u003Cbr>\n\n## Error Handling\n\n### `read_csv` and `scan_csv`\n\n| Input | Raises | Message |\n|:---|:---|:---|\n| File not found | `CsvReadError` | `Cannot open file: \u003Cpath>` |\n| Zero-byte file | `CsvReadError` | `CSV file is empty: '\u003Cpath>'` |\n| Blank header line | `CsvReadError` | `CSV header contains an empty column name` |\n| Binary \u002F NUL bytes | `CsvReadError` | `CSV input contains NUL bytes and appears to be binary or corrupted` |\n\n### Schema Validation\n\n`ar.validate()` returns a `ValidationResult`; it does not raise for validation failures. Check `result.passed` and `result.issues` for `dtype` or `required_column` rule violations.\n\n`validate()` currently operates on a single in-memory `ArFrame`. Chunked validation via `read_csv_chunked()` iterators is not yet supported directly. Validate each chunk individually or materialize the data before validation when working with streamed\u002Fchunked inputs.\n\n### Pipeline Step Errors\n\nUnknown step names raise `UnknownStepError` before execution begins.\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>📸 Peek at a 100 GB file without loading it\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\n\n`scan_csv` reads only the header + a sample to infer the schema. Zero data loaded.\n\n```python\n# Pass sample_size to control how many rows are evaluated for type inference\nschema = ar.scan_csv(\"100GB_file.csv\", sample_size=500)\n# {'id': 'int64', 'name': 'string', 'is_active': 'bool', 'revenue': 'float64'}\n```\n\nUseful for exploring datasets before committing memory.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>📄 Read JSON Lines (JSONL \u002F NDJSON) files\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\n\n`read_jsonl` parses one JSON object per line into an ArFrame. Blank lines are skipped, missing keys become nulls, and mixed-type columns are coerced to string — the same rules as `from_pandas`.\n\n```python\n# events.jsonl\n# {\"user\": \"alice\", \"score\": 9.5, \"active\": true}\n# {\"user\": \"bob\",   \"score\": 8.1, \"active\": false}\n\nframe = ar.read_jsonl(\"events.jsonl\")\n\n# Limit rows\nframe = ar.read_jsonl(\"large.jsonl\", nrows=1000)\n\n# Non-UTF-8 encoding\nframe = ar.read_jsonl(\"data.ndjson\", encoding=\"latin-1\")\n\n# Plug straight into the cleaning pipeline\nclean = ar.pipeline(frame, [(\"strip_whitespace\",), (\"drop_nulls\",)])\n```\n\nRaises `ar.JsonlReadError` with the 1-based line number if a line contains invalid JSON.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>📦 Export to Parquet for columnar analytics pipelines\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\n\n`write_parquet` exports an ArFrame to a Parquet file via pyarrow.  Install the optional extra first:\n\n```bash\npip install arnio[parquet]\n```\n\n```python\n# Basic export\nar.write_parquet(frame, \"output.parquet\")\n\n# Choose compression codec: \"snappy\" (default), \"gzip\", \"zstd\", \"brotli\", \"none\"\nar.write_parquet(frame, \"output.parquet\", compression=\"zstd\")\n\n# Control row group size for large files\nar.write_parquet(frame, \"output.parquet\", row_group_size=50_000)\n\n# .pq extension also accepted\nar.write_parquet(frame, \"output.pq\")\n```\n\nRaises `ImportError` with an install hint if pyarrow is not available.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>👀 Preview rows without pandas conversion or full-column Python list materialization\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\n\n`preview()` reads only the first `n` rows directly from the C++ frame — no pandas conversion triggered.\n\n```python\nframe = ar.read_csv(\"huge_file.csv\")\n\nprint(frame.preview())      # first 5 rows (default)\nprint(frame.preview(n=10))  # first 10 rows\n```\n\nRaises `ValueError` for invalid `n` (zero, negative, or non-integer).\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>💰 Financial Decimal Support\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\n\n`arnio` provides support for converting Python `decimal.Decimal` objects.\n\n* **Behavior**: Python `Decimal` objects are automatically preserved as high-precision strings during serialization\u002Fbinding to prevent floating-point precision loss.\n* **Caveat**: When reading back into Pandas, `to_pandas()` returns these as string (`object` dtype) columns. You will need to explicitly cast them back to `Decimal` objects on the resulting DataFrame if you want to resume exact math.\n\nExample:\n\n```python\nfrom decimal import Decimal\n\nimport pandas as pd\n\nimport arnio as ar\n\ndf = pd.DataFrame({\n    \"price\": [Decimal(\"19.99\"), Decimal(\"29.95\")]\n})\n\nframe = ar.from_pandas(df)  # Decimal values safely preserved as exact strings\nresult = ar.to_pandas(frame)\n# result[\"price\"] will be string objects [\"19.99\", \"29.95\"]\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>🧩 Add custom steps without touching C++\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\n\nRegister any Python function as a pipeline step. It receives a `DataFrame`, returns a `DataFrame`.\n\n```python\ndef remove_outliers(df, column=\"revenue\", threshold=100_000):\n    return df[df[column] \u003C= threshold]\n\nar.register_step(\"remove_outliers\", remove_outliers)\nar.register_step(\"team:drop_nulls\", remove_outliers)  # namespaced custom step\n\n# Use builtin: for an explicit built-in step, and your own prefixes\n# like team: or plugin_name: to avoid name collisions.\n\n# Introspect built-in and custom step names without reaching into internals.\nprint(ar.list_steps())\n\n# Opt in to a context object only when you need execution metadata.\ndef capture_context(df, context=None):\n    print(context.step_name, context.step_index, context.total_steps)\n    return df\n\n# Now use it in any pipeline alongside native C++ steps\nclean = ar.pipeline(frame, [\n    (\"builtin:strip_whitespace\",),\n    (\"remove_outliers\", {\"column\": \"revenue\", \"threshold\": 50000}),\n    (\"drop_duplicates\",),\n])\n```\n\nNeed to inspect the built-in kwargs a step accepts before assembling a pipeline?\n\n```python\nsignatures = ar.get_builtin_step_signatures()\nprint(list(signatures[\"drop_nulls\"].parameters))  # [\"subset\"]\nprint(list(signatures[\"filter_rows\"].parameters))  # [\"column\", \"op\", \"value\"]\n```\n\nNeed to restore the registry back to built-in steps only during tests?\n\n```python\nar.reset_steps()\n\nprint(ar.list_steps())\n# Only built-in steps remain\n```\n\nCustom steps run through a pandas↔ArFrame conversion bridge. Prototype in Python, then optionally migrate hot paths to C++ for full speed.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>🔄 Custom Step Overwrite Policy\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\n\nBy default, trying to register a custom step with a name that is already taken by another custom Python step will raise a `ValueError` to prevent silent overwriting.\n\nTo intentionally replace an existing custom **Python** step, pass `overwrite=True`:\n\n```python\ndef custom_logging(df):\n    print(\"Running step v1\")\n    return df\n\nar.register_step(\"log_data\", custom_logging)\n\n# This will succeed and safely overwrite the original logic\ndef custom_logging_v2(df):\n    print(\"Running step v2\")\n    return df\n\nar.register_step(\"log_data\", custom_logging_v2, overwrite=True)\n```\n> Note: Built-in C++ pipeline steps (like \"drop_nulls\") can never be overwritten, even if overwrite=True is explicitly supplied.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>✂️ Slice rows with head() and tail()\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\n\n`head()` and `tail()` return the first or last `n` rows as a new `ArFrame`.\n```python\nframe = ar.read_csv(\"data.csv\")\n\nframe.head()     # first 5 rows (default)\nframe.head(10)   # first 10 rows\nframe.tail(3)    # last 3 rows\n\n# n larger than row count returns all rows safely\nframe.head(1000)\n\n# n=0 returns an empty ArFrame\nframe.head(0)\n```\n\nRaises `ValueError` for negative or boolean `n`.\n\u003C\u002Fdetails>\n\n### Pipeline verbose diagnostics\n\nEnable lightweight pipeline diagnostics with `verbose=True`:\n\n```python\nresult = ar.pipeline(\n    frame,\n    [\n        (\"strip_whitespace\",),\n        (\"drop_nulls\",),\n    ],\n    verbose=True,\n)\n```\n\nThis logs step execution order, execution path, elapsed time,\nand row-count changes through the `arnio` logger.\n\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## 🔗 Integrations\n\nArnio is designed to make the rest of the Python data stack more productive,\nnot to replace it.\n\n| Workflow | How Arnio helps |\n|:---|:---|\n| **pandas** | Clean, validate, and profile messy `DataFrame`s through `df.arnio`. |\n| **NumPy** | Prepare typed numeric data before array\u002Fmodeling workflows. |\n| **scikit-learn** | Use Arnio cleaning as a preprocessing layer before model training. |\n| **DuckDB \u002F Arrow** | Validate and prepare data before analytics and columnar exchange. Export ArFrame to pyarrow.Table via ``ar.to_arrow(frame)``. |\n| **notebooks** | Inspect quality issues and cleaning suggestions before analysis. |\n\n### DuckDB registration\n\nUse `ar.register_duckdb(frame, conn, \"table_name\")` to register an ArFrame directly as a DuckDB relation without writing pandas conversion glue yourself. DuckDB is an optional dependency — install it with `pip install duckdb` when needed.\n\nFor development and CI, `pip install -e \".[dev]\"` now includes DuckDB so the integration test module in `tests\u002Ftest_integrations_duckdb.py` runs in the default test job.\n\n```python\nimport duckdb\nimport arnio as ar\n\nframe = ar.read_csv(\"data.csv\")\nconn = duckdb.connect()\nar.register_duckdb(frame, conn, \"my_table\")\nresult = conn.execute(\"SELECT * FROM my_table\").fetchdf()\n```\n\n### Row-dropping and schema-change behavior\n\n`ArnioCleaner` enforces a strict transformer contract by default:\n\n- **Row-count-changing steps** (`drop_nulls`, `drop_duplicates`,\n  `filter_rows`, `keep_rows_with_nulls`) are **rejected by default** with a\n  clear `ValueError`. To allow row-count changes, pass `allow_row_count_change=True`.\n\n- **Column schema-changing steps** (`rename_columns`, `drop_columns`,\n  `drop_constant_columns`, `combine_columns`) are **rejected by default**\n  and allowed only when `allow_schema_changes=True` is passed.\n\n```python\n# Schema-preserving (default strict mode)\ncleaner = ArnioCleaner(\n    steps=[\n        (\"strip_whitespace\",),\n        (\"fill_nulls\", {\"value\": 0}),\n    ],\n)\n\n# Opt-in column schema changes\ncleaner = ArnioCleaner(\n    steps=[\n        (\"rename_columns\", {\"old_name\": \"new_name\"}),\n        (\"drop_constant_columns\",),\n    ],\n    allow_schema_changes=True,\n)\n```\n\n### Pandas accessor\n\n```python\ndf = pd.read_csv(\"raw_customers.csv\")\n\nclean_df = df.arnio.clean(drop_duplicates=True)\nquality = clean_df.arnio.profile()\nvalidation = clean_df.arnio.validate({\n    \"email\": ar.Email(nullable=False),\n    \"user_code\": ar.Regex(r\"^USR-\\d{4}$\", nullable=False),\n    \"age\": ar.Int64(nullable=True, min=0),\n    \"score\": ar.Custom(\"positive\"),\n})\n```\n\nThis keeps pandas as the analysis tool while Arnio handles the preparation,\nquality, and validation layer.\n\n> Product direction: **[PROJECT_DIRECTION.md](PROJECT_DIRECTION.md)**\n\n## 📘 Examples\n\nThese examples demonstrate how Arnio integrates with the Python data ecosystem.\n\nThey follow a simple workflow:\n\n**clean\u002Fvalidate data with Arnio → analyze with other tools**\n\n### 🔹 Interoperability Examples\n\n- **Arnio + pandas**\n  Clean and normalize messy tabular data using Arnio, then analyze it using pandas.\n  Run:\n```bash\n  python examples\u002Farnio_with_pandas.py\n```\n\n- **Arnio + NumPy**\n  Prepare numeric data safely using Arnio, then perform computations using NumPy.\n  Run:\n```bash\n  python examples\u002Farnio_with_numpy.py\n```\n\n- **Arnio + scikit-learn**\n  Prepare messy data with Arnio, then train a model with scikit-learn.\n  Run:\n```bash\n  python examples\u002Farnio_with_sklearn.py\n```\n\n- **Arnio + DuckDB**\n  Clean data with Arnio, then run SQL queries using DuckDB.\n  Run:\n```bash\n  python examples\u002Farnio_with_duckdb.py\n```\n\n- **Arnio + Arrow**\n  Export ArFrame to pyarrow.Table using ``ar.to_arrow()`` for zero-copy interop with Arrow-native tools.\n  Run:\n```bash\n  python examples\u002Farnio_with_arrow.py\n```\n\n\n\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## 🔍 Why Arnio exists\n\nEvery data project starts the same way:\n\n```python\ndf = pd.read_csv(\"data.csv\")              # 💥 RAM spike — entire file as raw strings\ndf.columns = df.columns.str.strip()        # Why is this not automatic?\ndf[\"name\"] = df[\"name\"].str.strip()        # Python loop over every cell\ndf[\"name\"] = df[\"name\"].str.lower()        # Another Python loop\ndf = df.dropna()                           # Another pass\ndf = df.drop_duplicates()                  # Another pass\n```\n\nSix lines. Four full-data passes. All in interpreted Python. This is fine for a Jupyter demo — but it doesn't scale, it doesn't compose, and it definitely doesn't belong in production.\n\n**Arnio intercepts this entire pattern.** It moves the preparation layer into a predictable pipeline, accelerates supported operations in C++, and gives you clean data for pandas, NumPy, scikit-learn, DuckDB, or notebooks.\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"50%\">\n\n### Without Arnio\n```python\ndf = pd.read_csv(path)\ndf.columns = df.columns.str.strip()\nfor col in str_cols:\n    df[col] = df[col].str.strip()\n    df[col] = df[col].str.lower()\ndf = df.dropna(subset=[\"revenue\"])\ndf = df.drop_duplicates()\n# 6+ lines, multiple passes, pure Python\n```\n\n\u003C\u002Ftd>\n\u003Ctd width=\"50%\">\n\n### With Arnio\n```python\nframe = ar.read_csv(path)\ndf = ar.to_pandas(ar.pipeline(frame, [\n    (\"strip_whitespace\",),\n    (\"normalize_case\", {\"case_type\": \"lower\"}),\n    (\"drop_nulls\", {\"subset\": [\"revenue\"]}),\n    (\"drop_duplicates\",),\n]))\n# Declarative. Single pipeline. C++ execution.\n```\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## 🏗️ Architecture\n\nArnio is not a pandas wrapper. It's a separate runtime with its own data model.\n\n```mermaid\nflowchart LR\n  subgraph python[\"Your Python Code\"]\n    PY[\"frame = ar.read_csv('data.csv')\\nclean = ar.pipeline(frame, [...])\\ndf = ar.to_pandas(clean)\"]\n  end\n\n  python -->|\"pybind11 boundary\"| cpp\n\n  subgraph cpp[\"C++ Runtime (_arnio_cpp)\"]\n    direction TB\n    CSV[\"CsvReader\\n• RFC 4180\\n• BOM strip\\n• Type inference\\n• Quoted fields\"]\n    FRAME[\"Frame \u002F Column\\n• Columnar\\n• std::variant\\n• Bool null masks\\n• O(1) column lookup\"]\n    CLEAN[\"Cleaning Engine\\n• drop_nulls\\n• fill_nulls\\n• drop_dupes\\n• strip_ws\\n• normalize\\n• rename\u002Fcast\"]\n    CSV --> FRAME --> CLEAN\n  end\n\n  cpp -->|\"to_pandas() → zero-copy NumPy buffer (numerics\u002Fbools)\"| OUT[\"pandas DataFrame\"]\n```\n\n### Design decisions that matter\n\n| Decision | What it means |\n|:---|:---|\n| **Columnar storage** | Data lives in typed `std::vector`s — `vector\u003Cint64_t>`, `vector\u003Cdouble>`, `vector\u003Cstring>` — not rows of variants. Cache-friendly and SIMD-ready. |\n| **Boolean null masks** | Nulls are tracked in a separate `vector\u003Cbool>`, keeping data vectors dense. No sentinel values, no NaN tricks. |\n| **Two-pass CSV read** | Pass 1 infers types across all rows. Pass 2 parses values directly into the correct typed column. No string→object→cast overhead. |\n| **Zero-copy bridge** | `to_pandas()` exposes C++ memory directly via NumPy's buffer protocol where supported. Numeric columns preserve the fast zero-copy path by default, while `copy=True` requests defensive pandas-owned buffers. |\n| **Step registry** | Built-in and native steps use the C++ core via `_STEP_REGISTRY`; Python-backed built-ins dispatch through `_PYTHON_STEP_REGISTRY`; custom user-defined steps follow the same Python registry path. Adding a new cleaning primitive is a single function + one registry entry. |\n\n> Full architecture documentation: **[ARCHITECTURE.md](ARCHITECTURE.md)**\n> API reference guide: **[Arnio API Reference](.\u002FAPI_REFERENCE.md)**\n\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## 🏎️ Benchmarks\n\n> **Reference environment**: Ubuntu, Python 3.12, synthetic messy CSV inputs.\u003Cbr>\n> **Reproduce**: `make benchmark` — generates deterministic tall and wide datasets and runs both engines.\n\nTo reproduce the published numbers from a fresh checkout:\n\n```bash\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npython -m pip install -U pip\npython -m pip install -e .\npython benchmarks\u002Fgenerate_data.py\npython benchmarks\u002Fbenchmark_vs_pandas.py\n```\n\n`benchmarks\u002Fgenerate_data.py` uses deterministic NumPy seeds, so every run creates the same `benchmarks\u002Fbenchmark_1m.csv` tall input and `benchmarks\u002Fbenchmark_wide.csv` wide input. The benchmark then executes three pandas runs and three arnio runs for each case, printing average wall-clock time from `time.perf_counter()` and peak Python allocation from `tracemalloc`. For cleaner comparisons, close other memory-heavy processes and run the script from the repository root after installing the same Python, pandas, NumPy, compiler, and arnio commit you want to compare.\n\nExpected output format:\n\n```text\nTall CSV (1,000,000 rows x 12 columns)\nMetric                     pandas        arnio\n────────────────────────────────────────────\nExec Time (avg)       4.73s         5.75s\nPeak RAM               211MB         212MB\nSpeed: 0.8x | RAM: -1% reduction\n\nWide CSV (5,000 rows x 256 columns)\nMetric                     pandas        arnio\n────────────────────────────────────────────\nExec Time (avg)       ...s          ...s\nPeak RAM              ...MB         ...MB\nSpeed: ...x | RAM: ...% reduction\n```\n\nSmall differences are expected across CPUs, operating systems, compilers, Python builds, and pandas\u002FNumPy versions. If you share benchmark results in an issue or PR, include your OS, Python version, CPU model, pandas\u002FNumPy versions, arnio commit, and the full command output so maintainers can compare like for like.\n\n**Arnio is near memory parity in the reference benchmark** while replacing ad-hoc Python string loops with a compiled, declarative pipeline. Validate memory and speed on your own workload. The execution time gap is a known, active optimization target — the current `drop_duplicates` and `strip_whitespace` implementations use unoptimized row-key serialization.\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd>✅ \u003Cb>What's already won\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>🎯 \u003Cb>What's being optimized\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\n\n- Native C++ parsing eliminates Python memory spikes\n- Columnar storage matches pandas' internal efficiency\n- Declarative API eliminates `.apply()` spaghetti\n- Zero-copy bridge for numeric conversions\n\n\u003C\u002Ftd>\n\u003Ctd>\n\n- `drop_duplicates` — replace string serialization with hash-based comparisons\n- `strip_whitespace` — in-place mutation instead of copy-on-write\n- Parallel column processing via `std::thread`\n- **[Help close the gap →](https:\u002F\u002Fgithub.com\u002Fim-anishraj\u002Farnio\u002Fissues)**\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003Cbr>\n\n### 🧠 Auto Clean Memory Benchmark\n\nTo measure the peak memory and execution time of the `auto_clean` pipeline using realistic dataset sizes:\n\n```bash\npython benchmarks\u002Fbenchmark_auto_clean_memory.py --rows 100000\n```\n\nThis script generates a reproducible synthetic dataset with mixed column types (strings, ints, floats, booleans, nulls, and duplicates) and measures:\n- `ar.read_csv` performance\n- `ar.auto_clean(mode=\"safe\")` performance (low-risk cleanup like whitespace trimming)\n- `ar.auto_clean(mode=\"strict\")` performance (includes type casting and deduplication)\n\nThe dataset is regenerated deterministically unless `--reuse-file` is provided.\nEach `auto_clean` benchmark run reloads the dataset to avoid mutation or caching effects between runs.\n\nOptions:\n- `--repeat N` runs each operation multiple times and reports average (and min\u002Fmax range).\n- `--seed N` changes the deterministic dataset seed.\n- `--reuse-file` reuses an existing dataset file instead of regenerating it.\n- `--keep-file` keeps the generated CSV (otherwise it is removed at the end).\n\nExpected output format:\n\n```text\nOperation                    Time(s)     Peak Py(MiB)\n--------------------------------------------------------------------\nar.read_csv           0.042 (0.041-0.044)    4.52 (4.50-4.60)\nar.auto_clean(safe)   0.012 (0.011-0.013)    0.15 (0.14-0.16)\nar.auto_clean(strict) 0.035 (0.034-0.036)    1.20 (1.18-1.22)\n--------------------------------------------------------------------\nTotal avg (Read+Strict)       0.077             4.52\n```\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## 🧰 Cleaning primitives\n\nMost operations below run natively in C++. Currently, `filter_rows`, `replace_values` and `standardize_missing_tokens` run via the Python (pandas) backend and may be optimized in C++ later.\n\n| Primitive | What it does | Example |\n|:---|:---|:---|\n| `drop_nulls` | Remove rows with null\u002Fempty values | `ar.drop_nulls(frame, subset=[\"age\"])` |\n| `drop_columns` | Remove selected columns while preserving the remaining order | `frame = ar.drop_columns(frame, [\"debug_col\"])` |\n| `drop_empty_columns` | Remove columns whose values are all null\u002Fempty | `frame = ar.drop_empty_columns(frame)` |\n| `keep_rows_with_nulls` | Keep only rows that contain at least one null | `ar.keep_rows_with_nulls(frame, subset=[\"age\"])` |\n| `validate_columns_exist` | Fail early when required columns are missing | `ar.validate_columns_exist(frame, [\"age\"])` |\n| `filter_rows` | Filter rows using comparison operators | `ar.filter_rows(frame, column=\"age\", op=\">\", value=18)` |\n| `fill_nulls` | Replace nulls with a scalar | `ar.fill_nulls(frame, 0, subset=[\"revenue\"])` |\n| `drop_duplicates` | Deduplicate rows (first\u002Flast\u002Fnone) | `ar.drop_duplicates(frame, keep=\"first\")` |\n| `drop_constant_columns` | Remove columns with only one unique value | `ar.drop_constant_columns(frame)` |\n| `clip_numeric` | Clip numeric values to lower and\u002For upper bounds | `ar.clip_numeric(frame, lower=0, upper=100)` |\n| `coalesce_columns` | Select the first non-null value from a list of columns | `ar.coalesce_columns(frame, subset=[\"phone\", \"mobile\"], output_column=\"contact\")` |\n| `combine_columns` | Combine multiple columns into a single output column | `ar.combine_columns(frame, subset=[\"first\", \"last\"], separator=\" \", output_column=\"name\")` |\n| `strip_whitespace` | Trim leading\u002Ftrailing spaces from strings | `ar.strip_whitespace(frame)` |\n| `standardize_missing_tokens` | Replace common missing-value strings with NaN | `ar.standardize_missing_tokens(frame)` |\n| `normalize_case` | Force lower\u002Fupper\u002Ftitle case | `ar.normalize_case(frame, case_type=\"title\")` |\n| `rename_columns` | Rename columns via mapping | `ar.rename_columns(frame, {\"old\": \"new\"})` |\n| `cast_types` | Cast column types with `errors=\"raise\"`, `\"coerce\"`, or `\"ignore\"` | `ar.cast_types(frame, {\"age\": \"int64\"}, errors=\"raise\")` |\n| `round_numeric_columns` | Round numeric columns (non-numeric columns in subset ignored safely) | `ar.round_numeric_columns(frame, decimals=2)` |\n| `replace_values` | Replace values using a mapping (column or whole-frame). Handles `None`\u002F`NaN`. | `ar.replace_values(frame, {\"active\": \"A\", \"inactive\": \"I\"}, column=\"status\")` |\n| `clean` | Convenience shorthand | `ar.clean(frame, drop_nulls=True)` |\n| `safe_divide_columns` | Divide one column by another, handling zero\u002Fnull denominators | `ar.safe_divide_columns(frame, numerator=\"revenue\", denominator=\"cost\", output_column=\"ratio\")` |\n| `drop_columns_matching` | Drop columns whose names match a regex pattern | `ar.drop_columns_matching(frame, pattern=\"^temp_\")` |\n| `trim_column_names` | Strip leading\u002Ftrailing whitespace from column names | `ar.trim_column_names(frame)` |\n| `select_columns` | Return a new frame containing only selected columns | `ar.select_columns(frame, [\"id\", \"name\"])` |\n| `slugify_column_names` | Normalise column names to snake_case | `ar.slugify_column_names(frame)` |\n\n#### `ArFrame.select_dtypes` — type-based column selection\n\nReturns a **new `ArFrame`** containing only the columns whose dtype matches the filter. Raises `ValueError` if no columns match.\n\n```python\nframe = ar.read_csv(\"data.csv\")\n\n# Keep only numeric columns\nnumeric = frame.select_dtypes(include=[\"int64\", \"float64\"])\n\n# Drop string columns\nwithout_strings = frame.select_dtypes(exclude=\"string\")\n```\n\n**Valid dtype strings:** `\"int64\"`, `\"float64\"`, `\"string\"`, `\"bool\"`, `\"null\"`\n\n- At least one of `include` or `exclude` must be given — raises `ValueError` otherwise.\n- `include` and `exclude` must not overlap — raises `ValueError` if they share a dtype.\n- Unknown dtype strings raise `ValueError` with a list of valid options.\n- Raises `ValueError` when no columns match (never returns an empty frame silently).\n- Column order in the result always matches the original frame.\n\nOr compose them all into a **pipeline**:\n\n```python\nclean = ar.pipeline(frame, [\n    (\"validate_columns_exist\", {\"columns\": [\"name\", \"city\", \"revenue\"]}),\n    (\"drop_columns\", {\"columns\": [\"debug_notes\"]}),\n    (\"strip_whitespace\",),\n    (\"standardize_missing_tokens\",),\n    (\"normalize_case\", {\"case_type\": \"lower\"}),\n    (\"fill_nulls\", {\"value\": \"unknown\", \"subset\": [\"city\"]}),\n    (\"drop_duplicates\", {\"keep\": \"first\"}),\n])\n```\n\n### Winsorize outliers\n\n`winsorize_outliers()` clips extreme numeric values using lower and upper quantiles. Non-numeric columns are ignored unless explicitly selected in `subset`.\n\n```python\nframe = ar.read_csv(\"data.csv\")\n\nresult = ar.winsorize_outliers(\n    frame,\n    lower=0.05,\n    upper=0.95,\n)\n```\n\nIt can also be used inside `ar.pipeline()` as `(\"winsorize_outliers\", {\"lower\": 0.05, \"upper\": 0.95})`.\n\n### 🔁 Replace values\n\nUse `replace_values` to substitute values using a mapping. It works as a pipeline step (Python backend) and can operate on a single column or the whole frame when `column` is omitted. It also understands null semantics: using `None` (or `np.nan`) as a mapping key targets existing nulls, and mapping a value to `None` creates real nulls.\n\nColumn-specific example:\n\n```python\nclean = ar.pipeline(frame, [\n    (\"replace_values\", {\"mapping\": {\"active\": \"A\", \"inactive\": \"I\"}, \"column\": \"status\"}),\n])\n```\n\nWhole-frame example (no `column`):\n\n```python\nclean = ar.pipeline(frame, [\n    (\"replace_values\", {\"mapping\": {None: \"MISSING\", \"active\": \"A\", \"inactive\": \"I\"}}),\n])\n```\n\nDirect API:\n\n```python\nframe2 = ar.replace_values(frame, {\"active\": \"A\", \"inactive\": \"I\"})\n```\n\n### 🔎 Filter rows inside pipelines\n\nUse `filter_rows` to keep only rows matching a condition.\n\n```python\nclean = ar.pipeline(frame, [\n    (\"filter_rows\", {\n        \"column\": \"revenue\",\n        \"op\": \">=\",\n        \"value\": 1000\n    }),\n])\n```\n\nSupported operators:\n\n- `>`\n- `\u003C`\n- `>=`\n- `\u003C=`\n- `==`\n- `!=`\n\nWorks with:\n\n- integers\n- floats\n- strings\n- booleans\n\n### 🔎 Isolate rows with null values\n\nUse `keep_rows_with_nulls` to audit incomplete data — keep only rows that have at least one null.\n\n```python\nframe = ar.read_csv(\"data.csv\")\n\n# Keep all rows that have at least one null anywhere\nnulls = ar.keep_rows_with_nulls(frame)\n\n# Keep rows where specifically 'age' or 'score' is null\nnulls = ar.keep_rows_with_nulls(frame, subset=[\"age\", \"score\"])\n\n# Works inside a pipeline too\nresult = ar.pipeline(frame, [\n    (\"keep_rows_with_nulls\", {\"subset\": [\"age\"]}),\n])\n```\n\nUseful for data auditing — inspect what's missing before deciding how to fill or drop.\n\n### Boolean string normalization\n\n```python\nclean = ar.parse_bool_strings(frame)\n```\n\nThis normalizes values such as `\"yes\"`, `\"no\"`, `\"true\"`, `\"false\"`, `\"y\"`, `\"n\"`, `\"1\"`, and `\"0\"` into boolean values while preserving unsupported values unchanged.\n\nColumns containing both parsed boolean values and unsupported string values may round-trip as strings because of ArFrame column typing semantics.\n\n\u003Cbr>\n### 🔢 Safe column division\n\nDivide one column by another while handling division by zero and null denominators explicitly:\n\n```python\nresult = ar.safe_divide_columns(\n    frame,\n    numerator=\"revenue\",\n    denominator=\"cost\",\n    output_column=\"ratio\",\n    fill_value=0.0,  # used when denominator is zero or null\n)\n```\n\n> When the denominator is **zero or null**, the result is replaced with `fill_value` (default `0.0`) instead of raising an error or producing `NaN`\u002F`Inf`.\n\n---\n\n\u003Cbr>\n\n## 📊 Pandas Dtype Support Matrix\n\nThis table helps users understand which pandas dtypes and workflows are fully supported, partially supported, unsupported, or planned.\n\nIf a dtype is partially supported, users may need conversion before processing. Unsupported dtypes should raise clear errors where applicable.\n\n| Pandas Dtype | Support Status | Notes \u002F Fix Hints |\n|---|---|---|\n| `int64` \u002F `Int64` | ✅ Supported | Fully supported with native C++ columnar storage. Nulls mapped to `pd.NA`. |\n| `float64` \u002F `Float64` | ✅ Supported | Fully supported with zero-copy conversion. Nulls mapped to `np.nan` or `pd.NA`. |\n| `bool` \u002F `boolean` | ✅ Supported | Native booleans supported with C++ backing. Nulls mapped to `pd.NA`. |\n| `string` \u002F `string[python]` | ✅ Supported | Native string extension type. Recommended for text. Nulls mapped to `pd.NA`. |\n| `object` (strings \u002F scalars) | ✅ Supported | Handled as text or coerced to common type if mixed. |\n| `object` (nested \u002F lists \u002F dicts) | ❌ Unsupported | Nested structures not allowed in flat columnar storage. Raises `TypeError`. |\n| `category` | ❌ Unsupported | Raises `TypeError` with fix hint. Convert to string: `df[\"col\"].astype(str)` |\n| `datetime64[ns]` \u002F timezone-aware | ❌ Unsupported | Raises `TypeError` with fix hint. Use `df[\"col\"].astype(str)` or string timestamps. |\n| `timedelta64[ns]` | ❌ Unsupported | Raises `TypeError` with fix hint. Use `df[\"col\"].dt.total_seconds()`. |\n| `complex64` \u002F `complex128` | ❌ Unsupported | Raises `TypeError` with fix hint. Split into real\u002Fimag columns or convert to strings. |\n\n### Notes\n\n- **Zero-copy Optimization**: Numeric columns (`int64`, `float64`) are optimized for fast zero-copy conversion between C++ and pandas where supported.\n- **Defensive Buffers**: Pass `copy=True` to `to_pandas()` when downstream pandas code needs defensive pandas-owned column buffers.\n- **Boolean Buffers**: Boolean conversion is copied because `std::vector\u003Cbool>` cannot be exposed as a zero-copy NumPy buffer.\n- **Null Handling**: Columns with null masks are automatically converted to pandas nullable Extension dtypes (`Int64`, `BooleanDtype`, `StringDtype`).\n- **Index Drop**: pandas DataFrame indexes are currently not preserved during `from_pandas()` conversion; converted frames receive a default `RangeIndex` when converted back via `to_pandas()`.\n- **Validation**: Attempting to convert any unsupported type will raise a clear, user-friendly `TypeError` detailing the column name and how to fix\u002Fpreprocess it.\n\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## 🧠 Data quality engine\n\nArnio now includes built-in dataset understanding before you analyze in pandas.\n\n```python\nreport = ar.profile(frame)\nprint(report.summary())\n\nsuggestions = ar.suggest_cleaning(frame)\nclean = ar.pipeline(frame, suggestions)\n```\n\nFor production data contracts:\n\n```python\n# Register a custom validator once, then reference it by name in any schema\nar.register_validator(\"positive\", lambda v: v > 0)\n\nschema = ar.Schema({\n    \"id\": ar.Int64(nullable=False, unique=True),\n    \"email\": ar.Email(nullable=False),\n    \"phone\": ar.PhoneNumber(nullable=False),\n\n    \"user_type\": ar.String(nullable=False),\n\n    # country becomes required when user_type == \"international\"\n    \"country\": ar.String(\n        nullable=True,\n        required_if=(\"user_type\", \"international\"),\n    ),\n\n    # CurrencyCode validates 3-letter uppercase formats (e.g., USD, EUR, INR).\n    \"currency\": ar.CurrencyCode(),\n\n    # LanguageCode validates lowercase ISO 639-1 language codes (e.g., en, hi, fr).\n    \"language\": ar.LanguageCode(),\n\n    # TimeZone validates IANA timezone identifiers (e.g., Asia\u002FKolkata).\n    \"timezone\": ar.TimeZone(),\n\n    \"username\": ar.String(min_length=3, max_length=20),\n    \"user_code\": ar.Regex(r\"^USR-\\d{4}$\", nullable=False),\n    \"revenue\": ar.Custom(\"positive\", nullable=True, required_if=(\"user_type\", \"merchant\")),\n    \"signup_date\": ar.Date(nullable=False),\n    \"created_at\": ar.DateTime(nullable=False, format=\"%Y-%m-%d\"),\n\n})\n\nresult = ar.validate(frame, schema)\n\nif not result.passed:\n    summary = result.summary()\n    print(summary[\"issues_by_rule\"])\n    print(summary[\"issues_by_column\"])\n    print(summary[\"issues_by_column_and_rule\"])\n    print(result.to_pandas())\n    print(result.to_markdown(max_issues=10))\n```\n### Numeric string compatibility hints\n\nValidation messages indicate when string values appear safely convertible\nto numeric dtypes.\n\n```python\nframe = ar.from_pandas(\n    pd.DataFrame(\n        {\n            \"age\": [\"1\", \"2\", \"3\"],\n        }\n    )\n)\n\nschema = ar.Schema(\n    {\n        \"age\": ar.Int64(),\n    }\n)\n\nresult = ar.validate(frame, schema)\n\nprint(result.issues[0].message)\n# Column 'age' has dtype 'string'; expected 'int64'.\n# Values appear safely convertible to 'int64'\n```\n\nIn this example, `country` becomes required only when\n`user_type == \"international\"`.\n\nDate validates strict YYYY-MM-DD calendar dates.\n\n### Phone number validation\n\n`PhoneNumber()` validates common international and formatted phone number strings.\n\n```python\nschema = ar.Schema({\n    \"phone\": ar.PhoneNumber(nullable=False),\n})\n\nresult = ar.validate(frame, schema)\nprint(result.passed)\n```\n\nAccepted examples include:\n- `+1-555-123-4567`\n- `+91 9876543210`\n- `5551234567`\n\n### Warning-only validation\n\n```python\nschema = ar.Schema(\n    {\n        \"age\": ar.Int64(\n            min=18,\n            severity=\"warning\",\n        )\n    }\n)\n\nresult = ar.validate(frame, schema)\n\nprint(result.passed)  # True\nprint(result.issue_count)  # Warning issues are still reported\n```\n\nWarning-level issues remain visible in validation results without failing the overall validation status.\n\n### URL validation\n\n`URL()` validates that values are well-formed URLs. By default, both `http` and `https` schemes are accepted.\n\n```python\nschema = ar.Schema({\n    \"website\": ar.URL(nullable=False),\n})\nresult = ar.validate(frame, schema)\nprint(result.passed)\n```\n\nUse `allowed_schemes` to restrict which URL schemes are valid:\n\n```python\n# https only\nschema = ar.Schema({\n    \"website\": ar.URL(allowed_schemes=[\"https\"]),\n})\n\n# multiple schemes\nschema = ar.Schema({\n    \"endpoint\": ar.URL(allowed_schemes=[\"https\", \"ftp\"]),\n})\n```\n\nAny URL with a scheme not in `allowed_schemes` will fail validation.\n\n### Schema JSON round-trips\n\n```python\nschema = ar.Schema(\n    {\n        \"id\": ar.String(nullable=False),\n        \"created_at\": ar.DateTime(format=\"%Y-%m-%dT%H:%M:%S\"),\n    },\n    strict=True,\n    unique=[\"id\"],\n)\n\npayload = schema.to_json()\nrestored = ar.Schema.from_json(payload)\n```\n\nSee [examples\u002Fschema_validation.py](examples\u002Fschema_validation.py) for a complete runnable tutorial covering `Schema`, field types, invalid-row reporting, and `ValidationResult` output.\n\n`ValidationResult.to_markdown()` is useful in CI logs, GitHub comments, or data quality reports because it renders a compact validation summary plus a GitHub-friendly issue table.\n\nFor multi-column uniqueness (composite keys):\n\n```python\nschema = ar.Schema({\n    \"user_id\": ar.Int64(nullable=False),\n    \"course_id\": ar.Int64(nullable=False),\n}, unique=[\"user_id\", \"course_id\"])\n\nresult = ar.validate(frame, schema)\n```\n\n\nFor automatic cleaning suggestions based on the profile:\n\n```python\nsuggestions = ar.suggest_cleaning(frame)\n# e.g. [(\"strip_whitespace\", {\"subset\": [\"name\", \"city\"]}),\n#       (\"drop_duplicates\", {\"keep\": \"first\"})]\nclean = ar.pipeline(frame, suggestions)\n```\n\nFor low-risk automatic cleanup in one call:\n\n```python\nclean, report = ar.auto_clean(frame, return_report=True)\n```\n\nFor strict automatic cleanup, inspect type casts before applying them:\n\n```python\nreport = ar.auto_clean(frame, mode=\"strict\", dry_run=True)\ncast_mapping = dict(report.suggestions).get(\"cast_types\")\n\nclean = ar.auto_clean(\n    frame,\n    mode=\"strict\",\n    allow_lossy_casts=True,\n    confirmed_casts=cast_mapping,\n)\n```\n\nThis is the layer pandas does not try to own: profiling, data contracts, row-level validation issues, and preview-gated cleaning suggestions for messy incoming datasets.\n\n\u003Cbr>\n\n### Beginner-friendly auto-clean tutorial\n\nUse this workflow when you receive a small messy dataset and want to inspect what Arnio will change before applying it.\n\n```python\nimport arnio as ar\nimport pandas as pd\n\nraw = pd.DataFrame(\n    {\n        \"order_id\": [1001, 1002, 1002, 1003, 1004],\n        \"customer\": [\" Ishan \", \" Prasoon \", \" Prasoon \", \" Pranay \", \" Dhruv \"],\n        \"city\": [\" Paris \", \"London\", \"London\", \" New York \", \" Tokyo \"],\n    }\n)\n\nframe = ar.from_pandas(raw)\n\nreport = ar.profile(frame)\nsummary = report.summary()\nprint(summary)\n\nsuggestions = ar.suggest_cleaning(frame)\nprint(suggestions)\n# [('strip_whitespace', {'subset': ['customer', 'city']}), ('drop_duplicates', {'keep': 'first'})]\n\nsafe = ar.auto_clean(frame)\nstrict = ar.auto_clean(frame, mode=\"strict\")\n```\n\nMessy input:\n\n| order_id | customer | city |\n|:--|:--|:--|\n| 1001 | ` Ishan ` | ` Paris ` |\n| 1002 | ` Prasoon ` | `London` |\n| 1002 | ` Prasoon ` | `London` |\n| 1003 | ` Pranay ` | ` New York ` |\n| 1004 | ` Dhruv ` | ` Tokyo ` |\n\nExpected cleaned output with `mode=\"strict\"`:\n\n| order_id | customer | city |\n|:--|:--|:--|\n| 1001 | Ishan | Paris |\n| 1002 | Prasoon | London |\n| 1003 | Pranay | New York |\n| 1004 | Dhruv | Tokyo |\n\n`mode=\"safe\"` only trims whitespace. Use `mode=\"strict\"` when you also want deterministic built-in cleanup such as exact duplicate removal. If strict mode proposes type casts, run `dry_run=True` first and pass the exact proposed mapping as `confirmed_casts` with `allow_lossy_casts=True`.\n\nSee [examples\u002Fauto_clean_tutorial.py](examples\u002Fauto_clean_tutorial.py) for a runnable version of this walkthrough, and [examples\u002Fschema_validation.py](examples\u002Fschema_validation.py) for a focused validation tutorial.\n\n> For strict mode data-loss risks and safe workflow, see [AUTO_CLEAN_GUIDE.md](AUTO_CLEAN_GUIDE.md).\n\n\u003Cbr>\n\n## Data Quality Reports\n\nArnio provides detailed profiling for datasets via `ar.profile()`. To generate the report shown in these examples, the following code was used:\n\n```python\nimport arnio as ar\nimport pandas as pd\n\n# Sample dataset used for these examples\ndata = {\n    \"user_id\": [101, 102, 103, 104],\n    \"email\": [\"test@arnio.ai\", \"invalid-email\", None, \"test@arnio.ai\"],\n    \"score\": [85.5, 90.0, None, 88.2]\n}\ndf = ar.from_pandas(pd.DataFrame(data))\n# Bounded profiling for large datasets (controls how many sample values are kept)\nreport = ar.profile(df, sample_size=5)\nsafe_report = report.to_dict(redact_sample_values=True)\n```\n\n### Profiling privacy and redaction\n\nProfiling helps you understand data, but some report fields can still expose\nreal emails, names, IDs, or other sensitive values. Before you paste output into\nGitHub issues, Slack, public notebooks, or shared logs, check whether you are\nsharing **aggregate statistics only** or **raw\u002Fsample cell values**.\n\n**What is aggregate-only vs may expose raw values**\n\n| Field or export | Aggregate-only? | May expose raw \u002F sample data? |\n| --- | --- | --- |\n| `row_count`, `column_count`, `duplicate_rows`, `duplicate_ratio`, `quality_score`, `score_components` | Yes | No |\n| `null_count`, `null_ratio`, `unique_count`, `unique_ratio`, whitespace \u002F empty-string counts | Yes | No |\n| Numeric `min` \u002F `max` \u002F `mean` \u002F `std` \u002F `q25`–`q95` | Yes | Statistics only; small tables can still be identifying |\n| Numeric `iqr`, `outlier_lower_bound`, `outlier_upper_bound`, `outlier_count`, `outlier_ratio` | Yes | Aggregate Tukey-fence summary (thresholds and counts, not which rows are outliers) |\n| `semantic_type`, `suggested_dtype`, `warnings` | Metadata \u002F hints | Can imply PII type (for example email-like), not redaction |\n| `ColumnProfile.sample_values` (in-memory) | No | **Yes** — first *N* non-null values (`sample_size` on `ar.profile()`) |\n| `ColumnProfile.top_values` | Includes counts \u002F ratios | **Yes** — frequent **actual** values (exact or approximate; see below) |\n| `report.to_dict()` | Mixed | **Yes** — includes `sample_values` and `top_values` unless you redact samples |\n| `report.to_dict(redact_sample_values=True)` | Mixed | `sample_values` → `\"[REDACTED]\"` (same list length); `top_values[*].value` → `\"[REDACTED]\"` while counts and ratios remain |\n| `report.to_markdown()`, `report.summary()` | Yes | No raw cell values in output |\n| `report.to_html()` \u002F notebook display of `report` | Partial | **Shows `top_values`** chips; does not list `sample_values`. Use `redact_top_values=True` or `exclude_columns` for safer sharing. |\n| `report.to_pandas()` | Partial | Includes **`top_values`**, not `sample_values` |\n| `ProfileComparison.to_dict()` | Nested profiles | **Yes** — embeds `left_profile` \u002F `right_profile` via default `to_dict()` |\n\nArnio does **not** auto-mask emails, phone numbers, or IDs by column type. Use the\ncontrols below for safer sharing.\n\n**Safe sharing practices**\n\n- **JSON logs and artifacts:** `report.to_dict(redact_sample_values=True)` before writing or uploading.\n- **Collect fewer samples:** `ar.profile(frame, sample_size=0)` skips `sample_values` (defaults still apply to `top_values` counts on string columns).\n- **Text summaries for CI or comments:** prefer `report.to_markdown()` or `report.summary()` when you do not need per-value examples.\n- **Notebooks and HTML exports:** use `report.to_html(redact_top_values=True)` to replace every top-value chip label with `[REDACTED]` while preserving counts and ratios. To drop entire sensitive columns from the table, add `exclude_columns=[\"ssn\", \"email\"]`. Avoid saving unredacted `report.to_html()` output for sensitive data.\n- **GitHub bug reports and examples:** use synthetic data (`user@example.com`, `ID-001`), a minimal CSV, and redacted `to_dict()` output — not production dumps.\n- **Pandas export:** `ar.to_pandas(frame)` returns full table data; redaction applies to **quality reports**, not the underlying frame.\n- **Profile comparison:** `ProfileComparison.to_dict()` nests full profiles; build shared artifacts with `profile.to_dict(redact_sample_values=True)` if needed.\n\n```python\nimport arnio as ar\nimport pandas as pd\n\ndf = ar.from_pandas(pd.DataFrame({\n    \"email\": [\"user@example.com\", \"bad-email\", None],\n    \"user_id\": [101, 102, 103],\n}))\nreport = ar.profile(df, sample_size=2)\n\n# Safer JSON for sharing (sample_values and top_values values redacted)\nsafe_json = report.to_dict(redact_sample_values=True)\n\n# Safer HTML export (top-value chip labels replaced with [REDACTED])\nsafe_html = report.to_html(redact_top_values=True)\n# or exclude an entire column from the HTML table:\n# safe_html = report.to_html(redact_top_values=True, exclude_columns=[\"email\"])\n\n# Safer text summary (no sample_values or top_values in output)\nprint(report.to_markdown())\n```\n\nWhen `approx_top_values=True`, high-cardinality string columns estimate\n`top_values` from a deterministic sample. Each column may set\n`top_values_is_approximate`, `top_values_sample_count`, and\n`top_values_sample_ratio`. Counts and ratios are sample-based, but displayed\n**values are still real strings from your data** — treat them like `top_values`\nfor privacy.\n\n```python\n# Optional: approximate top values for high-cardinality string columns\nreport = ar.profile(\n    df,\n    approx_top_values=True,\n    approx_top_values_min_unique=1000,\n    approx_top_values_min_ratio=0.2,\n    approx_top_values_sample_size=2000,\n)\n```\n\n### Notebook dashboard (Jupyter \u002F Colab)\n\n`DataQualityReport` includes a notebook-friendly HTML dashboard. In a notebook, simply evaluate `report` in a cell to see a rich, static summary (quality score, duplicates, nulls, warnings, top values, and cleaning suggestions).\n\nIf you want to embed or save the HTML explicitly:\n\n```python\nfrom IPython.display import HTML\n\nHTML(report.to_html())\n# or: report.to_html(file_path=\"data_quality_report.html\")\n```\n\nSample output now includes quantiles and IQR outlier summary for numeric columns:\n\nFor numeric columns with at least four non-null values, Arnio reports `iqr` (`q75 − q25`), Tukey fences `outlier_lower_bound` (`q25 − 1.5×IQR`) and `outlier_upper_bound` (`q75 + 1.5×IQR`), plus `outlier_count` and `outlier_ratio`. A value is counted as an outlier only if it is **strictly less** than the lower bound or **strictly greater** than the upper bound. With fewer than four non-null values, quantiles may still appear but IQR\u002Foutlier fields are `null` in JSON.\n\nIllustrative `age` column (not from the `user_id` \u002F `email` \u002F `score` sample below):\n\n```json\n{\n  \"age\": {\n    \"dtype\": \"float64\",\n    \"mean\": 35.2,\n    \"std\": 10.1,\n    \"min\": 18.0,\n    \"max\": 60.0,\n    \"q25\": 27.5,\n    \"q50\": 35.0,\n    \"q75\": 44.0,\n    \"q95\": 57.0,\n    \"iqr\": 16.5,\n    \"outlier_lower_bound\": 2.75,\n    \"outlier_upper_bound\": 68.75,\n    \"outlier_count\": 0,\n    \"outlier_ratio\": 0.0,\n    \"null_count\": 0\n  }\n}\n```\n\n### Compare Profiles\nUse `ar.compare_profiles()` to compare two `DataQualityReport` profiles and flag per-column drift.\n\n```python\nbaseline = ar.profile(ar.read_csv(\"baseline.csv\"))\ncurrent  = ar.profile(ar.read_csv(\"current.csv\"))\n\ncomparison = ar.compare_profiles(baseline, current)\nprint(comparison.drift_report[\"score\"][\"status\"])  # \"ok\", \"warning\", or \"changed\"\nprint(comparison.status_counts)  # {\"ok\": 2, \"warning\": 1, \"changed\": 0}\n```\n\nUse `ar.check_quality_gates()` when profile drift should become a pass\u002Ffail\ndecision for CI, data releases, or monitoring.\n\n```python\nresult = ar.check_quality_gates(\n    baseline,\n    current,\n    max_row_count_delta_ratio=0.10,\n    max_null_ratio_delta=0.05,\n    max_numeric_mean_delta_ratio=0.10,\n)\n\nif not result.passed:\n    print(result.to_markdown())\n    result.raise_for_failures()\n```\n\n> **Scoring Contract:** The `quality_score` starts at 100.0 and subtracts capped penalties for duplicates, nulls, and suggested dtype mismatches. The `score_components` field exposes these penalties as negative values. (Note: Semantic-validity penalties are intentionally out of scope for the current implementation.)\n\n### 1. Terminal Representation (Simplified Example)\n*A simplified view of the standard string representation of the report object:*\n\n```text\nDataQualityReport(\n    row_count=4,\n    column_count=3,\n    memory_usage=733,\n    duplicate_rows=0,\n    quality_score=100.0,\n    score_components={},\n    columns={\n        'user_id': ColumnProfile(dtype='int64', semantic_type='identifier', unique_count=4),\n        'email': ColumnProfile(dtype='string', semantic_type='categorical', null_count=1, unique_ratio=0.666667, min=13, max=13, mean=13.0),\n        'score': ColumnProfile(dtype='float64', semantic_type='numeric', null_count=1, mean=87.9, min=85.5, max=90.0, std=1.8493, q25=86.85, q50=88.2, q75=89.1, q95=89.82, iqr=None, outlier_lower_bound=None, outlier_upper_bound=None, outlier_count=None, outlier_ratio=None, warnings=['contains_nulls'])\n    }\n)\n```\n\n### 2. JSON Format (Excerpts from .to_dict())\n*Key fields from the structured JSON export for integration with APIs or dashboards:*\n\n```json\n{\n  \"row_count\": 4,\n  \"column_count\": 3,\n  \"memory_usage\": 733,\n  \"duplicate_rows\": 0,\n  \"duplicate_ratio\": 0.0,\n  \"quality_score\": 100.0,\n  \"score_components\": {},\n  \"columns\": {\n    \"user_id\": {\n      \"dtype\": \"int64\",\n      \"semantic_type\": \"identifier\",\n      \"null_count\": 0,\n      \"unique_ratio\": 1.0\n    },\n    \"email\": {\n      \"dtype\": \"string\",\n      \"semantic_type\": \"categorical\",\n      \"null_count\": 1,\n      \"unique_ratio\": 0.666667,\n      \"min\": 13,\n      \"max\": 13,\n      \"mean\": 13.0,\n      \"warnings\": [\"contains_nulls\"]\n    },\n    \"score\": {\n      \"dtype\": \"float64\",\n      \"semantic_type\": \"numeric\",\n      \"null_count\": 1,\n      \"mean\": 87.9,\n      \"min\": 85.5,\n      \"max\": 90.0,\n      \"std\": 1.8493,\n      \"q25\": 86.85,\n      \"q50\": 88.2,\n      \"q75\": 89.1,\n      \"q95\": 89.82,\n      \"iqr\": null,\n      \"outlier_lower_bound\": null,\n      \"outlier_upper_bound\": null,\n      \"outlier_count\": null,\n      \"outlier_ratio\": null,\n      \"warnings\": [\"contains_nulls\"],\n      \"histogram\": [\n        {\"bucket_start\": 85.5, \"bucket_end\": 85.95, \"count\": 1, \"ratio\": 0.333333},\n        {\"bucket_start\": 85.95, \"bucket_end\": 86.4, \"count\": 0, \"ratio\": 0.0},\n        {\"bucket_start\": 86.4, \"bucket_end\": 86.85, \"count\": 0, \"ratio\": 0.0},\n        {\"bucket_start\": 86.85, \"bucket_end\": 87.3, \"count\": 0, \"ratio\": 0.0},\n        {\"bucket_start\": 87.3, \"bucket_end\": 87.75, \"count\": 0, \"ratio\": 0.0},\n        {\"bucket_start\": 87.75, \"bucket_end\": 88.2, \"count\": 0, \"ratio\": 0.0},\n        {\"bucket_start\": 88.2, \"bucket_end\": 88.65, \"count\": 1, \"ratio\": 0.333333},\n        {\"bucket_start\": 88.65, \"bucket_end\": 89.1, \"count\": 0, \"ratio\": 0.0},\n        {\"bucket_start\": 89.1, \"bucket_end\": 89.55, \"count\": 0, \"ratio\": 0.0},\n        {\"bucket_start\": 89.55, \"bucket_end\": 90.0, \"count\": 1, \"ratio\": 0.333333}\n      ]\n    },\n    \"city\": {\n      \"dtype\": \"string\",\n      \"semantic_type\": \"categorical\",\n      \"null_count\": 0,\n      \"top_values\": [\n        {\"value\": \"London\", \"count\": 3, \"ratio\": 0.5},\n        {\"value\": \"Paris\", \"count\": 2, \"ratio\": 0.333}\n      ]\n    }\n  },\n  \"suggestions\": [\n    {\n      \"step\": \"cast_types\",\n      \"kwargs\": {\"score\": \"float64\"},\n      \"confidence_score\": 0.95,\n      \"confidence_reason\": \"Column 'score' conforms perfectly to float64 structure.\"\n    }\n  ]\n}\n```\nColumns where a single non-null value represents at least 95% of rows are reported with a `near_constant` warning.\nColumns with a very high ratio of unique values are reported with a `high_cardinality` warning because they may represent identifiers, leakage risk, or modeling hazards.\n\nExample near-constant distribution:\n\n```json\n{\n  \"row_count\": 100,\n  \"top_values\": [\n    {\"value\": \"London\", \"count\": 95, \"ratio\": 0.95},\n    {\"value\": \"Paris\", \"count\": 5, \"ratio\": 0.05}\n  ],\n  \"warnings\": [\"near_constant\"]\n}\n```\n\n### 3. Example Summary Table\n*A manually formatted Markdown table representing the core metrics:*\n\n| Metric | Value |\n| :--- | :--- |\n| **Row Count** | 4 |\n| **Column Count** | 3 |\n| **Memory Usage** | 733 bytes |\n| **Duplicates** | 0 (0.0%) |\n| **Quality Score** | 100.0 |\n\u003Cbr>\n\n### Bootstrapping a Schema from a Quality Report\n\nAfter profiling a dataset, you can automatically generate a validation schema\ndirectly from the report:\n\n```python\nimport arnio as ar\n\nframe = ar.from_pandas(df)\nreport = ar.profile(frame)\n\nschema = ar.Schema.bootstrap_from_report(report)\nresult = schema.validate(frame)\n\nprint(result.passed)\nprint(result.summary())\n```\n\nThe inferred schema uses conservative defaults: column dtypes are mapped\ndirectly from the report, and a column is marked `nullable=True` if any\nnull values were observed during profiling.\n\n## 🗺️ Roadmap\n\n| Phase | Focus | Status |\n|:---|:---|:---:|\n| Stable foundations | Cross-platform wheels · CI\u002FCD · PyPI publishing · Google Colab support · release hardening | ✅ Shipped |\n| Current focus | Reliability · contributor workflow · data-stack integrations · public API stability · benchmark baselines | 🔨 Active |\n| Next focus | Broader streaming workflows · richer file-format coverage · reproducible performance comparisons | 📋 Planned |\n| Later focus | Parallel column processing · SIMD string operations · lower-copy native cleaning paths | 💭 Exploring |\n\nBefore expanding the backlog again, maintainers should complete the\n**[Core Stability Sprint](CORE_STABILITY_SPRINT.md)**: install reliability,\ncorrectness hardening, public API stability, benchmark baselines, and PR queue\nhygiene.\n\nThe current release line is tracked in `pyproject.toml` and `CHANGELOG.md`.\nFeature status in this roadmap is phase-based so it does not drift behind the\npackage version.\n\n> For CLI command reference and examples, see [CLI_REFERENCE.md](CLI_REFERENCE.md).\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## 💬 Community\n\nJoin the **[Arnio Discord Community](https:\u002F\u002Fdiscord.gg\u002FxsEw7r78M)** for quick setup help, contributor onboarding, GSSoC 2026 coordination, feature discussion, and community updates.\n\nDiscord is for fast conversation and support. GitHub remains the source of truth for issue assignment, PR reviews, bugs, roadmap decisions, and releases.\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FxsEw7r78M\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin%20Arnio%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white\" alt=\"Join Arnio Discord\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cbr>\n\n---\n\n\u003Cbr>\n\n## 📚 Documentation\n\n- [Project Direction](PROJECT_DIRECTION.md)\n- [Core Stability Sprint](CORE_STABILITY_SPRINT.md)\n- [Roadmap](ROADMAP.md)\n- [Troubleshooting Guide](docs\u002FTROUBLESHOOTING.md)\n- [Nullable dtype compatibility](docs\u002Fnullable_dtype_compat.md)\n\n## 🤝 Contribute\n\nArnio is a **[GSSoC 2026](https:\u002F\u002Fgssoc.girlscript.tech\u002F)** project with a structured contributor backlog across beginner, intermediate, and advanced tracks.\n\n### You don't need C++ to contribute\n\nMost new features are pure Python pipeline steps:\n\n```python\n# 1. Write a function that takes a DataFrame and returns a DataFrame\ndef remove_special_chars(df, columns=None):\n    cols = columns or df.select_dtypes(\"object\").columns\n    for col in cols:\n        df[col] = df[col].str.replace(r\"[^a-zA-Z0-9\\s]\", \"\", regex=True)\n    return df\n\n# 2. Register it\nar.register_step(\"remove_special_chars\", remove_special_chars)\n\n# 3. Write tests, open a PR. That's it.\n```\n\nIf Arnio renames a built-in or registered pipeline step in a future release,\nthe old step name can stay temporarily available and will emit a\n`DeprecationWarning` while routing execution to the new canonical step.\n\n### If you do know C++\n\nThe biggest performance wins are in:\n- **`drop_duplicates`** — replacing `std::ostringstream` row serialization with proper hash-based comparisons\n- **`strip_whitespace`** — converting from copy-on-write to in-place mutation\n- **Parallel column processing** — `std::thread` across independent columns\n\n### Getting started\n\n```bash\n# macOS \u002F Linux\ngit clone https:\u002F\u002Fgithub.com\u002Fim-anishraj\u002Farnio.git && cd arnio\nmake install   # pip install -e \".[dev]\" + pre-commit\nmake test      # pytest with coverage\nmake lint      # ruff + black\n\n# Windows\npython examples\u002Fcheck_env.py\npip install -e \".[dev]\"\npre-commit install\npytest tests\u002F -v\n```\n### Building frames without a CSV\n\nUse `ArFrame.from_records` (also available as `ar.from_records`) to build\nsmall frames inline — useful for tests, quick experiments, or feeding\nhand-crafted data into the pipeline without writing a CSV file.\n\n```python\nimport arnio as ar\n\n# list-of-dicts — column names inferred from keys\nframe = ar.from_records([\n    {\"id\": 1, \"name\": \"alice\", \"score\": 95},\n    {\"id\": 2, \"name\": \"bob\",   \"score\": 88},\n])\n\n# list-of-lists or tuples — columns must be supplied\nframe2 = ar.from_records(\n    [(1, \"alice\", 95), (2, \"bob\", 88)],\n    columns=[\"id\", \"name\", \"score\"],\n)\n```\n\nMissing keys in dict records are filled with `None`. Nested values raise `TypeError`. An empty list raises `ValueError`.\n\n## Type Casting\n\nYou can cast columns to a different data type using the `.astype()","Arnio 是一个用于 Python 的 C++ 加速数据质量工具包，主要功能包括 CSV 解析、数据清洗、模式验证、数据剖析以及与 pandas 的集成。它通过预编译的 C++ 引擎实现了高效的数据处理，能够快速解析、推断类型、去除空白字符、去重、验证和分析数据，然后将干净的数据返回给用户熟悉的工具如 pandas、NumPy 等。适用于需要对大量 CSV 数据进行快速预处理并确保其质量的场景，例如数据科学项目中的数据准备阶段。",2,"2026-06-11 03:58:36","CREATED_QUERY"]