[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72281":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":40,"readmeContent":41,"aiSummary":42,"trendingCount":16,"starSnapshotCount":16,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},72281,"DataFlow","OpenDCAI\u002FDataFlow","OpenDCAI","Easy Data Preparation with latest LLMs-based Operators and Pipelines.","https:\u002F\u002FOpenDCAI.github.io\u002FDataFlow-Doc\u002F",null,"Python",4747,529,186,8,0,160,362,737,480,110.17,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39],"data","data-agent","data-cleaning","data-pipelines","data-processing","data-science","data-synthesis","gradio-interface","llms","operators","quick-data-processing","sglang-bankend","vllm-backend","2026-06-12 04:01:04","# DataFlow\n\n\n\u003Cdiv align=\"center\">\n\n**Generate, Clean, and Prepare LLM Data, All-in-One**\n\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa19865e5-221d-4c12-bb57-17421df87c8a\">\n\n\u003C!-- [![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002FOpenDCAI\u002FDataFlow?style=social)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow) -->\n\n[![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenDCAI\u002FDataFlow?style=social)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow)\n[![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-raw\u002FOpenDCAI\u002FDataFlow)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow\u002Fissues)\n[![issue resolution](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed-raw\u002Fopendcai\u002FDataFlow)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow\u002Fissues?q=is%3Aissue%20state%3Aclosed)\n[![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-pr-raw\u002FOpenDCAI\u002FDataFlow)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow\u002Fpulls)\n[![issue resolution](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-pr-closed-raw\u002Fopendcai\u002FDataFlow)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow\u002Fpulls?q=is%3Apr+is%3Aclosed)\n[![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FOpenDCAI\u002FDataFlow)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow\u002Fgraphs\u002Fcontributors)\n[![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frepo-size\u002FOpenDCAI\u002FDataFlow?color=green)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow)\n\n[![PyPI version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fopen-dataflow)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fopen-dataflow\u002F)\n[![PyPI - Python Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fopen-dataflow)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fopen-dataflow\u002F)\n[![PyPI - Downloads](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fdm\u002Fopen-dataflow?style=flat&logo=python)](https:\u002F\u002Fpypistats.org\u002Fpackages\u002Fopen-dataflow)\n[![Downloads](https:\u002F\u002Fstatic.pepy.tech\u002Fbadge\u002Fopen-dataflow)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fopen-dataflow)\n\n[![Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1haosl2QS4N4HM7u7HvSsz_MnLabxexXl?usp=sharing)\n[![Docker](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocker-Ready-blue?logo=docker)](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fmolyheci\u002Fdataflow)\n[![Documents](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-Click_here-brightgreen?logo=read-the-docs)](https:\u002F\u002FOpenDCAI.github.io\u002FDataFlow-Doc\u002F)\n[![Arxiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTechnical_Report-2512.16676-b31b1b.svg?logo=arxiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.16676)\n[![Ask DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002FOpenDCAI\u002FDataFlow)\n\n\n[![Discord Online](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1479323317096939551?logo=discord&label=discord&color=%235966F0)](https:\u002F\u002Fdiscord.gg\u002Fe4mKEaFptu)\n[![wechat](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fwechat-brightgreen?logo=wechat&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3c2e5d4d-d1ea-4d8c-9146-ff14e657e857)\n\n\n\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F16045\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F16045\" alt=\"OpenDCAI%2FDataFlow | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\n\u003C!-- ![PyPI - Downloads](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fdd\u002Fopen-dataflow?style=flat&logo=python)\n![PyPI - Downloads](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fdw\u002Fopen-dataflow?style=flat&logo=python) -->\n\n\u003C!-- [![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FOpenDCAI\u002FDataFlow)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow\u002Fblob\u002Fmain\u002FLICENSE) -->\n\n\u003C!-- [![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002FOpenDCAI\u002FDataFlow)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow\u002Fcommits\u002Fmain\u002F) -->\n\n\u003C!--[![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-raw\u002FOpenDCAI\u002FDataFlow)](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow\u002Fissues) -->\n\n\n\nVisual, low-code pipelines with flexible orchestration across domains and use cases.💪\n\nTurn raw data into high-quality LLM training datasets.🔧\n\n🎉 Get smarter LLMs cheaply — give us a star ⭐ on GitHub for the latest update.\n\n**Beginner-friendly learning resources (continuously updated)**: \n[[🎬 Video Tutorials]](https:\u002F\u002Fspace.bilibili.com\u002F3546929239689711?spm_id_from=333.337.0.0)\n[[📚 Written Tutorials]](https:\u002F\u002Fwcny4qa9krto.feishu.cn\u002Fwiki\u002FI9tbw2qnBi0lEakmmAGclTysnFd)\n\n[简体中文](.\u002FREADME-zh.md) | English\n\n\n\u003C!-- \u003Cimg width=\"1568\" height=\"688\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6d8fd795-7f5b-4c45-b14d-5bbe6bf99766\" \u002F> -->\n\u003C\u002Fdiv>\n\n\n## 📰 0. News\n\n* **[2026-02-02] 🖥️ DataFlow WebUI is now available!**\n  Launch the visual pipeline builder with a single command: `dataflow webui`. Build and run DataFlow pipelines through an intuitive web interface. 👉 [WebUI Docs](#dfwebui)\n  \u003Cdiv style=\"display: flex; gap: 12px;\">\n    \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb4f172d6-7753-4121-b981-55046a7a9e43\" width=\"45%\" \u002F>\n    \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb2147987-3b1e-4f56-9818-3d5e7440fa58\" width=\"45%\" \u002F>\n  \u003C\u002Fdiv>\n* **[2026-01-20] 🌟 Awesome Works Using DataFlow is now live!**\n  A new section showcasing open-source projects and research built on DataFlow. Contributions are welcome! 👉 [Awesome Works](#awesome-dataflow)\n\n* **[2025-12-19] 🎉 Our DataFlow technical report is now available!**\n  Read and cite our work on arXiv: [https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.16676](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.16676)\n\n* **[2025-11-20] 🤖 Introducing New Data Agents for DataFlow!**\n  Try them out and follow the tutorial on Bilibili: [https:\u002F\u002Fspace.bilibili.com\u002F3546929239689711\u002Flists\u002F6761342?type=season](https:\u002F\u002Fspace.bilibili.com\u002F3546929239689711\u002Flists\u002F6761342?type=season)\n\n* **[2025-06-28] 🎉 DataFlow is officially released!**\n  Our data-centric AI system is now public. Stay tuned for future updates.\n\n\n## 🔍 1. What  is DataFlow？\n\n\u003C!--  \u003Cimg src=\".\u002Fstatic\u002Fimages\u002Fdataflow_framework.jpg\"> -->\n\n\u003C!--  ![dataflow_framework](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb44db630-754a-44a8-bec7-6d350bf5ed61) -->\n\n\n\nDataFlow is a data preparation and training system designed to **generate, refine, evaluate, and filter** high-quality data for AI from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG system, in domains such as healthcare, finance, legal, and academic research.\n\nThrough an `operator-based` design, DataFlow turns the entire data cleaning workflow into a reproducible, reusable, and shareable `pipeline`, providing core infrastructure for the Data-Centric AI community. Additionally, we develop an intelligent `DataFlow-agent` capable of dynamically assembling new `pipelines` by recombining existing or creating new `operators` on demand.\n\n\u003C!-- Specifically, we are constructing diverse `operators` leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct `pipelines`, collectively forming the comprehensive `DataFlow system`. Additionally, we develop an intelligent `DataFlow-agent` capable of dynamically assembling new `pipelines` by recombining existing `operators` on demand. -->\n![df_overview_final_300](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F57dd0838-6e24-4814-a89a-02ca0667bd5c)\n\n\u003C!-- 🔥 New: DataFlow WebUI is now available! Launch the visual pipeline builder with a single command: `dataflow webui`. Build and run DataFlow pipelines through an intuitive web interface. 👉 [DataFlow-WebUI](#54-webui) -->\n\n## 🔍 2. Key Features\n\n### ✅2.1  Ready-to-Use Data Synthesis and Cleaning Pipelines\n- High-Quality Training Data Generation\n  - Text, Math, and Code data generation (see DataFlow-Instruct-10K for results)\n  - Data generation via tools like AgenticRAG and Text2SQL\n- Structured Data Extraction\n  - Large-scale PDF → QA conversion\n  - Large-scale book PDF → Visual-QA conversion\n- Scientific Data Workflow Management\n  - Text2SQL workflow management (Accepted by ICDE 2026)\n  - Math data workflows (Accepted by KDD 2026)\n  \n### ⚙️2.2  Flexible Custom Pipeline Orchestration\n- 10+ core operators define interaction patterns and design principles\n- 100+ pipeline-specific operators available for reuse or reference\n- Full support for creating custom operators — plug-and-play, easily packaged and distributed via GitHub or PyPI\n\n### 🧠2.3  Reproducible, Reusable, and Shareable Data-Centric AI System\n- Data governance algorithms are encapsulated as operator pipelines, enabling reproducibility and fair comparison of different data governance strategies (❤️research-friendly)\n- Easily reuse swap underlying large models to analyze the relationship between model performance and data quality quickly\n- Built on Python and Git ecosystems for easy distribution, management, and traceability of high-quality, **user-defined** data governance operators and pipelines (❤️enterprise-friendly)\n\n\n## 🛠️ 3. DataFlow Suite \nThe DataFlow Suite provides the essential infrastructure to automate and scale LLM data preparation with DataFlow main repository. It comprises four tightly integrated layers:\n\n- [DataFlow-WebUI](#dfwebui) – An intuitive, visual interface for constructing and managing complex data pipelines through a drag-and-drop operator workflow.\n\n- [DataFlow-Agent](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow-Agent) – An AI-powered assistant that dynamically composes, executes, and optimizes operators and pipelines based on high-level user intent.\n\n- [DataFlow-Ecosystem](#awesome-dataflow) – A modular distribution layer that standardizes operator registration. It enables domain-specific modules (e.g., [DataFlow-MM](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow-MM), DataFlow-AI4S) to contribute extensible libraries under a unified abstraction.\n\n- [RayOrch](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FRayOrch) – A high-performance orchestration layer built on Ray, providing distributed compute scheduling and resource management for massive-scale data tasks.\n\nTogether, these components form a unified, extensible environment that transforms raw data into model-ready intelligence.\n\n## ✅ 4. Why use DataFlow?\nData generation and cleaning are crucial for high-quality models, but for both enterprises and individuals, these tasks are often time-consuming, labor-intensive, and costly. **DataFlow provides a one-stop solution to tackle these challenges efficiently.**\nCompared with systems like Nemo-Curator and Data-Juicer, DataFlow offers:\n- **Enhanced Support for Data Synthesis Modules** – Seamlessly integrates text, code, and math data generation pipeline for high-quality training datasets.\n- **PyTorch-like Programming Management** – Clear **Pipeline → Operator → Prompt** hierarchical structure for workflow control.\n- **Principled and Multi-Category Operator Classification** – Operators are systematically organized into multiple functional categories such as **generation, evaluation, filtering, and refinement**, forming a scientifically grounded, multi-dimensional taxonomy that reflects different stages of data preparation and enables precise operator selection and composition.\n- **User-Friendly Design for Easy Debugging and Onboarding** – Simplified workflow patterns that reduce the learning curve and accelerate experimentation.\n\n\n## 🔧 5. How do operators work？\nDataFlow operators are designed with **simplicity and clarity** in mind.\n\nOperators take structured inputs (JSON, JSONL, CSV) and produce high-quality outputs after intelligent processing.\nEach operator encapsulates a specific data processing task, providing a clean and consistent API that is easy to understand and integrate. The PyTorch-like design makes them intuitive and ready to use, allowing you to quickly build, combine, and customize pipelines without dealing with complex boilerplate code.\n\n For more details, refer to the [Operator Documentation](https:\u002F\u002Fopendcai.github.io\u002FDataFlow-Doc\u002Fzh\u002Fapi\u002Fhome\u002F). Below is a minimal example demonstrating how to invoke the `PromptedGenerator` operator: \n\n![dataflow_operator](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd79a0d8b-09ef-457e-af8b-85af0d03b73d)\n\nExample input data (json\u002Fjsonl-style):\n\n```json\n\u002F\u002F input.json\n[\n  {\"problem\": \"What is 17 + 25?\"},\n  {\"problem\": \"If x = 3, compute 2x^2 + 1.\"}\n]\n```\n\nOperator invocation code:\n\n```python\nfrom dataflow.operators.core_text import PromptedGenerator\nfrom dataflow.utils.storage import FileStorage\nfrom dataflow.serving import APILLMServing_request\n\n# set input file to global storage class\nstorage = FileStorage(first_entry_file_name=\".\u002Finput.json\",)\n\n# configure LLM serving (e.g., OpenAI API)\n# api key needs to be set via `export DF_API_KEY=sk-xxx`\nllm_serving = APILLMServing_request(\n    api_url=\"https:\u002F\u002Fapi.openai.com\u002Fv1\u002Fchat\u002Fcompletions\",\n)\n\nprompted_generator = PromptedGenerator(\n    llm_serving=llm_serving,  # pre-configured LLM backend\n    system_prompt=\"Please solve this math problem.\"\n)\n\nprompted_generator.run(\n    storage=self.storage.step(),  # data management (details omitted)\n    input_key=\"problem\",          # read from this column\n    output_key=\"solution\"         # write to this column\n)\n```\nAfter running, the operator will append the generated results into output_key. For example, the output data (json\u002Fjsonl-style) becomes:\n\n```json\n\u002F\u002F dataflow_step1.json\n[\n    {\"problem\":\"What is 17 + 25?\",\"solution\":\"42\"},\n    {\"problem\":\"If x = 3, compute 2x^2 + 1.\",\"solution\":\"19\"}\n]\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Ch2>🛠️ 6. Pipelines (Click to expand)\u003C\u002Fh2>\u003C\u002Fsummary>\n\n### 🔧 6.1 Ready-to-Use PipeLines\n\nCurrent Pipelines in Dataflow are as follows:\n\n- [📝 **Text Pipeline**](https:\u002F\u002Fopendcai.github.io\u002FDataFlow-Doc\u002Fen\u002Fguide\u002Ftextpipeline): Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.\n  - ![dataflow_text_pipeline](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F34e3aef2-ba4f-4997-9127-9d21fdb2dede)\n  - [[HuggingFace🤗 demo input &amp; output for **Text Pipeline**]](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpen-Dataflow\u002Fdataflow-demo-Text)\n- [🧠 **Reasoning Pipeline**](https:\u002F\u002Fopendcai.github.io\u002FDataFlow-Doc\u002Fen\u002Fguide\u002Freasoningpipeline\u002F#_2-question-handling): Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.\n  - ![dataflow_reasoning_pipeline](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffef5829b-3991-4dcb-99ad-d61d95c982ea)\n  - [[HuggingFace🤗 demo input &amp; output for **Reasoning Pipeline**]](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpen-Dataflow\u002Fdataflow-demo-Reasonning)\n- [🗃️ **Text2SQL Pipeline**](https:\u002F\u002Fopendcai.github.io\u002FDataFlow-Doc\u002Fen\u002Fguide\u002Ftext2sqlpipeline\u002F): Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.\n  - ![dataflow_text2sql_pipeline](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fbae9914e-851b-4502-8696-291d6c1b8824)\n  - [[HuggingFace🤗 demo input &amp; output for **Text2SQL Pipeline**]](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpen-Dataflow\u002Fdataflow-demo-Text2SQL)\n- [📚 **Knowlege Base Cleaning Pipeline**](https:\u002F\u002Fopendcai.github.io\u002FDataFlow-Doc\u002Fen\u002Fguide\u002Fr51ooua8\u002F): Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.\n  - ![dataflow_KnowledgeBaseClean_pipeline](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6f21e682-ec10-42af-b5e2-8fec2929eeae)\n- [🤖 **Agentic RAG Pipeline**](https:\u002F\u002Fopendcai.github.io\u002FDataFlow-Doc\u002Fen\u002Fguide\u002Fagenticrag_pipeline\u002F): Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.\n  - ![dataflow_agenticRAG_pipeline](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F65e80dca-f286-495b-abb7-804b3fc34a53)\n\n### ⚙️ 6.2 Flexible Operator PipeLines\n\nIn this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https:\u002F\u002FOpenDCAI.github.io\u002FDataFlow-Doc\u002F) for details.\n\n### 🤖 6.3 Agent Guided Pipelines\n\n\u003C!-- Building on top of this, we also provide the -->\n\n- **DataFlow Agent**: An intelligent assistant that performs data analysis, writes custom `operators`, and automatically orchestrates them into `pipelines` based on specific task objectives.\n\n  - ![dataflow_agent_pipeline](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffe0776fa-55bd-49cd-bfe6-06ad377f62bb)\n  - [[HuggingFace🤗 demo input &amp; output for **DataFlow Agent**]](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpen-Dataflow\u002Fdataflow-demo-Agent)\n\n\u003C!-- ### 3.1 Text Pipeline\n![](.\u002Fstatic\u002Fimages\u002Fdemo_reasoning.png) -->\n\n\u003C\u002Fdetails>\n\n\n## ⚡ 7. Quick Start\n\n### 🛠️ 7.1 Environment Setup and Installation\n> DataFlow supports Python>=3.10 environments, tested passed on Windows, Linux, and MacOS with Python 3.10, 3.11, and 3.12.\n\nPlease use the following commands for environment setup and installation👇\n\nWe recommend use [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) to install DataFlow for speed up.\n\n```shell\npip install uv\nuv pip install open-dataflow\n```\n\nIf you want to use your own GPU for local inference, please use:\n\n```shell\npip install uv\nuv pip install open-dataflow[vllm]\n```\n\nAfter installation, you can use the following command to check if dataflow has been installed correctly:\n\n```shell\ndataflow -v\n```\n\nIf installed correctly, you should see:\n\n```log\nopen-dataflow codebase version: 1.0.0\n        Checking for updates...\n        Local version:  1.0.0\n        PyPI newest version:  1.0.0\nYou are using the latest version: 1.0.0.\n```\n\n#### 🐳 7.2 Docker Installation (Alternative)\n\nWe also provide a **Dockerfile** for easy deployment and a **pre-built Docker image** for immediate use.\n\n##### Option 1: Use Pre-built Docker Image\n\nYou can directly pull and use our pre-built Docker image:\n\n```shell\n# Pull the pre-built image\ndocker pull molyheci\u002Fdataflow:cu124\n\n# Run the container with GPU support\ndocker run --gpus all -it molyheci\u002Fdataflow:cu124\n\n# Inside the container, verify installation\ndataflow -v\n```\n\n##### Option 2: Build from Dockerfile\n\nAlternatively, you can build the Docker image from the provided Dockerfile:\n\n```shell\n# Clone the repository (HTTPS)\ngit clone https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow.git\n# Or use SSH\n# git clone git@github.com:OpenDCAI\u002FDataFlow.git\n\ncd DataFlow\n\n# Build the Docker image\ndocker build -t dataflow:custom .\n\n# Run the container\ndocker run --gpus all -it dataflow:custom\n\n# Inside the container, verify installation\ndataflow -v\n```\n\n> **Note**: The Docker image includes CUDA 12.4.1 support and comes with vLLM pre-installed for GPU acceleration. Make sure you have [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html) installed to use GPU features.\n\n### 🚀 7.3 Quick Start with Google Colab\nYou can start your first DataFlow translation project directly on Google Colab.\nBy following the provided guidelines, you can seamlessly scale from a simple translation example to more complex DataFlow pipelines.\n\n👉 [Start DataFlow with Google Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1haosl2QS4N4HM7u7HvSsz_MnLabxexXl?usp=sharing)\n\n\n### 📖 7.4 Reference Project Documentation\n\nFor detailed **usage instructions** and **getting started guide**, please visit our [DataFlow Documentation](https:\u002F\u002FOpenDCAI.github.io\u002FDataFlow-Doc\u002F).\n\n[![Documents](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-Click_here-brightgreen?logo=read-the-docs)](https:\u002F\u002FOpenDCAI.github.io\u002FDataFlow-Doc\u002F)\n\n\u003Ca name=\"dfwebui\">\u003C\u002Fa>\n\n### 🖥️ 7.5 DataFlow-WebUI\nDataFlow provides a **Web-based UI (WebUI)** for visual pipeline construction and execution.\n\u003Cdiv style=\"display: flex; gap: 12px;\">\n  \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb4f172d6-7753-4121-b981-55046a7a9e43\" width=\"45%\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb2147987-3b1e-4f56-9818-3d5e7440fa58\" width=\"45%\" \u002F>\n\u003C\u002Fdiv>\n\n\nTo start `DataFlow-WebUI`, simply run following command after install the DataFlow main repository:\n```bash\ndataflow webui\n```\n\nThis will automatically download and launch the latest **DataFlow-WebUI** and open it in your browser (`http:\u002F\u002Flocalhost:8000\u002F` if it does not open automatically).\n\n#### 📚 7.5.1 WebUI Documentation\n\n* Chinese: [DataFlow-WebUI Documentation: https:\u002F\u002Fwcny4qa9krto.feishu.cn\u002Fwiki\u002FF4PDw76uDiOG42k76gGc6FaBnod](https:\u002F\u002Fwcny4qa9krto.feishu.cn\u002Fwiki\u002FF4PDw76uDiOG42k76gGc6FaBnod)\n* English: [DataFlow-WebUI Documentation: https:\u002F\u002Fwcny4qa9krto.feishu.cn\u002Fwiki\u002FSYELwZhh9ixcNwkNRnhcLGmWnEg](https:\u002F\u002Fwcny4qa9krto.feishu.cn\u002Fwiki\u002FSYELwZhh9ixcNwkNRnhcLGmWnEg)\n\n#### 🛠️ 7.5.2 Development Repository\n\n* [https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow-webui](https:\u002F\u002Fgithub.com\u002FOpenDCAI\u002FDataFlow-webui)\n\n\n## 🧪 8. Experimental Results\n\n### 8.1 DataFlow-Instruct-10k\n**DataFlow-Instruct-10K** is a unified multi-domain instruction dataset generated by the DataFlow framework. It is constructed through several automated data preparation pipelines spanning mathematical reasoning, code, and general text instructions. Each pipeline follows a generate–evaluate–filter–refine workflow to synthesize and curate high-quality instruction–response pairs. The resulting dataset contains approximately 10K samples and provides high-quality supervision for instruction tuning, enabling base models to approach the performance of fully trained instruct models with significantly fewer training examples. \n\nFor Detailed Experiments setting, please visit our [DataFlow Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.16676).\n\n\n| Model | Math-Avg | Code-Avg | Knowledge-Avg |\n|------|------|------|------|\n| **Qwen2-7B Series** ||||\n| Base | 20.1 | 66.3 | 76.2 |\n| + Infinity-Instruct-10K | 29.0 | 67.8 | 76.2 |\n| + Infinity-Instruct-1M | 27.9 | **68.2** | **76.2** |\n| + **DataFlow-Instruct-10K** | **32.4** | 66.2 | 76.1 |\n| **Qwen2.5-7B Series** ||||\n| Base | 37.1 | 76.5 | 76.0 |\n| + Infinity-Instruct-10K | 22.6 | 77.6 | 75.8 |\n| + Infinity-Instruct-1M | 33.3 | 78.0 | 75.8 |\n| + **DataFlow-Instruct-10K** | **46.7** | **78.6** | **76.2** |\n\n\n\u003Cdetails>\n\u003Csummary>\u003Ch3>🛠️ 8.2 Other Pipeline Results (Click to expand)\u003C\u002Fh3>\u003C\u002Fsummary>\n\n#### 8.2.1 Text Pipeline\n\n##### 8.2.1.1 Pre-training data filter pipeline\n\nFrom the SlimPajama-627B corpus, we extract a 100B-token subset and apply multiple DataFlow text-pretraining filters. We train a Qwen2.5-0.5B model from scratch for 30B tokens using the Megatron-DeepSpeed framework, the results are as follows:\n\n| Methods | ARC-C | ARC-E | MMLU | HellaSwag | WinoGrande | Gaokao-MathQA | Avg |\n| --- | --- | --- | --- | --- | --- | --- | --- |\n| **Random-30B** | 25.26 | 43.94 | 27.03 | 37.02 | 50.99 | 27.35 | 35.26 |\n| **Qurating-30B** | 25.00 | 43.14 | 27.50 | 37.03 | 50.67 | 26.78 | 35.02 |\n| **FineWeb-Edu-30B** | 26.45 | 45.41 | 27.41 | 38.06 | 50.43 | 25.64 | 35.57 |\n| **DataFlow-30B** | 25.51 | 45.58 | 27.42 | 37.58 | 50.67 | 27.35 | **35.69** |\n\n##### 8.2.1.2 SFT data filter and synthesis pipeline\n\nTo study small-scale SFT data quality, we fine-tune the Qwen2.5-7B base model using LLaMA-Factory on WizardLM and Alpaca datasets. For each dataset, we compared a randomly sampled set of 5K instances against a set of 5K instances filtered by DataFlow's SFT pipeline. Additionally, we synthesize a 15k-size dataset, DataFlow-SFT-15K, using DataFlow’s Condor Generator and Condor Refiner pipeline, followed by DataFlow’s SFT filtering pipeline (excluding the Instagram filter). Benchmarks include comprehensive Math, Code, and Knowledge evaluation suites.\n\n#### 8.2.2 Math Benchmarks\n\n| Methods | math | gsm8k | aime24 | minerva | olympiad | Avg |\n| --- | --- | --- | --- | --- | --- | --- |\n| **Alpaca (random)** | 54.9 | 77.2 | 13.3 | 14.0 | 27.0 | 37.3 |\n| **Alpaca (filtered)** | 60.3 | 80.0 | 13.3 | 14.7 | 30.7 | 39.8 |\n| **WizardLM (random)** | 61.1 | 84.2 | 6.7 | 18.0 | 29.3 | 39.9 |\n| **WizardLM (filtered)** | 69.7 | 88.8 | 10.0 | 19.9 | 35.4 | 44.8 |\n| **DataFlow-SFT-15K (random)** | 72.6 | 89.6 | 13.3 | 37.9 | 32.9 | **49.3** |\n| **DataFlow-SFT-15K (filtered)** | 73.3 | 90.2 | 13.3 | 36.0 | 35.9 | **49.7** |\n\n#### 8.2.3 Code Benchmarks\n\n| Methods | HumanEval | MBPP | Avg |\n| --- | --- | --- | --- |\n| **Alpaca (random)** | 71.3 | 75.9 | 73.6 |\n| **Alpaca (filtered)** | 73.8 | 75.7 | 74.8 |\n| **WizardLM (random)** | 75.6 | 82.0 | **78.8** |\n| **WizardLM (filtered)** | 77.4 | 80.4 | **78.9** |\n| **DataFlow-SFT-15K (random)** | 79.9 | 75.9 | 77.9 |\n| **DataFlow-SFT-15K (filtered)** | 82.9 | 74.9 | **78.9** |\n\n#### 8.2.4 Knowledge Benchmarks\n\n| Methods | MMLU | C-EVAL | Avg |\n| --- | --- | --- | --- |\n| **Alpaca (random)** | 71.8 | 80.0 | 75.9 |\n| **Alpaca (filtered)** | 71.8 | 80.0 | 75.9 |\n| **WizardLM (random)** | 71.8 | 79.2 | 75.5 |\n| **WizardLM (filtered)** | 71.9 | 79.6 | 75.8 |\n| **DataFlow-SFT-15K (random)** | 72.1 | 80.0 | **76.1** |\n| **DataFlow-SFT-15K (filtered)** | 72.2 | 80.4 | **76.3** |\n\n#### 8.2.5 Conversation Synthesis Pipeline\n\nWe synthesize DataFlow-Chat-15K using DataFlow's conversation-generation pipeline and fine-tune Qwen2.5-7B-Base on it. Baselines include ShareGPT-15K, UltraChat-15K, and their full (non-truncated) versions. We evaluate on domain-specific tasks (TopDial, Light) and general benchmarks (MMLU, AlpacaEval, Arena-Hard).\n\n##### 8.2.5.1 Conversation Benchmarks\n\n| Model | TopDial | Light | Avg |\n| --- | --- | --- | --- |\n| **Qwen2.5-7B** | 7.71 | 7.79 | 7.75 |\n| **+ ShareGPT-15K** | 7.75 | 6.72 | 7.24 |\n| **+ UltraChat-15K** | 7.72 | 6.83 | 7.28 |\n| **+ DataFlow-Chat-15K** | **7.98** | **8.10** | **8.04** |\n\n##### 8.2.5.2 General Benchmarks\n\n| Model | MMLU | AlpacaEval | Arena-Hard | Avg |\n| --- | --- | --- | --- | --- |\n| **Qwen2.5-7B** | 71.45 | 7.05 | 0.60 | 26.36 |\n| **+ ShareGPT-15K** | 73.09 | 3.70 | 1.30 | 26.03 |\n| **+ UltraChat-15K** | 72.97 | 3.97 | 0.80 | 25.91 |\n| **+ DataFlow-Chat-15K** | 73.41 | **10.11** | 1.10 | **28.21** |\n\n#### 8.2.6 Reasoning Pipeline\n\nWe adopt the NuminaMath dataset as a high-quality seed dataset. We compare three training sources: (1) a random 10K subset from Open-R1, (2) a random 10K subset from Synthetic-1, and (3) our 10K synthesized DataFlow-Reasoning-10K dataset constructed using DataFlow.\n\n| Setting | Model | gsm8k | math | amc23 | olympiad | gaokao24_mix | minerva | AIME24@32 | AIME25@32 | Avg |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| Baseline | **Qwen2.5-32B-Instruct** | 95.8 | 73.5 | 70.0 | 38.5 | 42.9 | 26.5 | 16.8 | 11.6 | 46.95 |\n| 1 Epoch | **+ SYNTHETIC-1-10k** | 92.9 | 71.8 | 52.5 | 38.4 | 23.1 | 24.3 | 35.6 | 34.0 | 46.6 |\n| 1 Epoch | **+ Open-R1-10k** | 91.5 | 72.3 | 65.0 | 38.4 | 20.9 | 24.6 | 43.0 | 33.5 | 48.7 |\n| 1 Epoch | **+ DataFlow-Reasoning-10K** | 93.9 | 72.3 | 72.5 | 38.7 | 38.5 | 26.5 | 35.9 | 34.5 | **51.6** |\n| 2 Epochs | **+ SYNTHETIC-1-10k** | 94.5 | 78.4 | 75.0 | 45.0 | 24.2 | 28.3 | 48.4 | 37.9 | 54.0 |\n| 2 Epochs | **+ Open-R1-10k** | 93.9 | 77.2 | 80.0 | 44.1 | 20.9 | 25.4 | 51.0 | 40.7 | 54.2 |\n| 2 Epochs | **+ DataFlow-Reasoning-10K** | 94.4 | 76.6 | 75.0 | 45.2 | 42.9 | 25.7 | 45.4 | 40.0 | **55.7** |\n\n#### 8.2.7 Code Pipeline\n\nWe randomly sample 20k instances from the Ling-Coder-SFT corpus and process them through the DataFlow Code Pipeline. This yields three curated code instruction datasets of different scales, DataFlow-Code-1K, DataFlow-Code-5K, and DataFlow-Code-10K, each designed to provide high-quality, pipeline-refined supervision signals for code generation tasks. We compare our synthesized datasets against Code-Alpaca-1k and Self-OSS-Instruct-SC2-Exec-Filter-1k.\n\n##### 8.2.7.1 Trained on Qwen2.5-7B-Instruct\n\n| Training Data | BigCodeBench | LiveCodeBench (v6) | CruxEval (I) | CruxEval (O) | HumanEval+ | Avg |\n| --- | --- | --- | --- | --- | --- | --- |\n| **Qwen2.5-7B-Instruct** | 35.3 | 23.4 | 44.8 | 43.9 | 72.6 | 44.0 |\n| **+ Code Alpaca-1K** | 33.3 | 18.7 | 45.6 | 46.4 | 66.5 | 42.1 |\n| **+ Self-OSS** | 31.9 | 21.4 | 46.9 | 45.9 | 70.1 | 43.2 |\n| **+ DataFlow-Code-1K** | 35.5 | 25.7 | 48.0 | 45.1 | 72.6 | 45.4 |\n| **+ DataFlow-Code-5K** | 36.2 | **26.4** | 48.6 | 45.0 | 73.2 | 45.9 |\n| **+ DataFlow-Code-10K** | **36.8** | 26.0 | **48.8** | **45.4** | **73.8** | **46.2** |\n\n##### 8.2.7.2 Trained on Qwen2.5-14B-Instruct\n\n| Training Data | BigCodeBench | LiveCodeBench (v6) | CruxEval (I) | CruxEval (O) | HumanEval+ | Avg |\n| --- | --- | --- | --- | --- | --- | --- |\n| **Qwen2.5-14B-Instruct** | 37.5 | 33.4 | 48.0 | 48.5 | 74.4 | 48.4 |\n| **+ Code Alpaca-1K** | 37.0 | 28.2 | 50.2 | 49.6 | 71.3 | 47.3 |\n| **+ Self-OSS** | 36.9 | 22.3 | 52.6 | 50.1 | 68.3 | 46.0 |\n| **+ DataFlow-Code-1K** | 41.4 | **33.7** | 51.0 | 50.9 | **77.3** | 50.9 |\n| **+ DataFlow-Code-5K** | 41.1 | 33.2 | 52.5 | 50.6 | 76.2 | 50.7 |\n| **+ DataFlow-Code-10K** | **41.9** | 33.2 | **52.9** | **51.0** | 76.2 | **51.0** |\n\n\u003C\u002Fdetails>\n\n## 📄 9. Publications\n\nOur team has published the following papers that form core components of the DataFlow system:\n\n| Paper Title                                                                                                             | DataFlow Component                                                                            | Venue | Year |\n| ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ----- | ---- |\n| [AgenticRAGTracer: A Clear and Stepwise-Process Benchmark for Agentic RAG](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.19127v1) | Agentic RAG Data Synthesis | ACL Findings  | 2026 |\n| [Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13903)  | Text2SQL Data Augmentation   | ICDE   | 2026 |\n| [Let&#39;s Verify Math Questions Step by Step](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13903)                                           | Math question quality evaluation                                                              | KDD   | 2026 |\n| [MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13383)           | Multimodal reasoning verification framework for data processing and evaluation                | ACL   | 2025 |\n| [Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.08102) | Multi-actor collaborative data selection mechanism for enhanced data filtering and processing | ACL   | 2025 |\n\n\n**Contributing Institutions**:\n\u003Cimg src=\".\u002Fstatic\u002Flogo\u002Fpku.png\" alt=\"PKU\" height=\"30\"\u002F>\n\u003Cimg src=\".\u002Fstatic\u002Flogo\u002Fhkust.png\" alt=\"HKUST\" height=\"30\"\u002F>\n\u003Cimg src=\".\u002Fstatic\u002Flogo\u002FCAS.png\" alt=\"CAS\" height=\"30\"\u002F>\n\u003Cimg src=\".\u002Fstatic\u002Flogo\u002Fshanghai_ailab.png\" alt=\"Shanghai AI Lab\" height=\"30\"\u002F>\n\u003Cimg src=\".\u002Fstatic\u002Flogo\u002Fbaichuan.png\" alt=\"Baichuan\" height=\"30\"\u002F>\n\u003Cimg src=\".\u002Fstatic\u002Flogo\u002Fant_group.png\" alt=\"Ant Group\" height=\"30\"\u002F>\n\n## 🏆 10. Awards & Achievements\n\nWe are honored to have received **first-place awards** in two major international AI competitions, recognizing the excellence and robustness of DataFlow and its reasoning capabilities:\n\n| Competition                                                               | Track                                                       | Award                          | Organizer                                                 | Date            |\n| ------------------------------------------------------------------------- | ----------------------------------------------------------- | ------------------------------ | --------------------------------------------------------- | --------------- |\n| **ICML 2025 Challenges on Automated Math Reasoning and Extensions** | Track 2:*Physics Reasoning with Diagrams and Expressions* | 🥇**First Place Winner** | ICML AI for Math Workshop & AWS Codabench                 | July 18, 2025   |\n| **2025 Language and Intelligence Challenge (LIC)**                  | Track 2:*Beijing Academy of Artificial Intelligence*      | 🥇**First Prize**        | Beijing Academy of Artificial Intelligence (BAAI) & Baidu | August 10, 2025 |\n\n\u003Cdiv align=\"center\">\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd align=\"center\" width=\"50%\">\n      \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F8f28e0fe-c883-42c0-b224-3693f6281a14\" alt=\"ICML 2025 Certificate\" width=\"95%\">\u003Cbr>\n      \u003Csub>\u003Cem>ICML 2025 Automated Math Reasoning Challenge — First Place Winner\u003C\u002Fem>\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"30%\">\n      \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F364618b6-4dfa-4c34-928f-e3da85cbd03a\" alt=\"LIC 2025 Certificate\" width=\"95%\">\u003Cbr>\n      \u003Csub>\u003Cem>BAAI Language & Intelligence Challenge 2025 — First Prize\u003C\u002Fem>\u003C\u002Fsub>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003C\u002Fdiv>\n\n\u003Ca id=\"awesome-dataflow\">\u003C\u002Fa>\n\n## 🌟 11. Awesome Work Using DataFlow & DataFlow Ecosystem\n\nThis section highlights **projects, research works, and applications** built on top of DataFlow or deeply integrated with the DataFlow ecosystem.\n\n**📌 Curated list of featured projects:**\n[[Awesome Work Using DataFlow](.\u002Fawesome_dataflow.md)]\n\nWe warmly welcome the community to contribute new entries via **Pull Requests**. 🙌 [Detailed Guidance](https:\u002F\u002Fopendcai.github.io\u002FDataFlow-Doc\u002Fen\u002Fguide\u002Fdf_ecosystem\u002F) can help you creating a Dataflow extension repository from DataFlow-CLI.\n\n## 💐 12. Acknowledgements\n\nWe sincerely thank [MinerU](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU) for their outstanding work, whose powerful PDF\u002Fdocument text extraction capabilities provided essential support for our data loading process.\nWe also thank [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) for offering an efficient and user-friendly framework for large model fine-tuning, which greatly facilitated rapid iteration in our training and experimentation workflows.\nOur gratitude extends to all contributors in the open-source community—their efforts collectively drive the development of DataFlow.\nWe thank Zhongguancun Academy for their API and GPU support.\n\n## 🤝 13. Community & Support\n\nJoin the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!\n\n•\t📮 [GitHub Issues](..\u002F..\u002Fissues): Report bugs or suggest features\n\n•\t🔧 [GitHub Pull Requests](..\u002F..\u002Fpulls): Contribute code improvements\n\n•\t💬 Join our community groups to connect with us and other contributors!\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F52febf13-5288-4bcd-95e8-9126dffbc409\" width=\"60%\">\n\u003C\u002Fdiv>\n\n## 📜 14. Citation\n\nIf you use DataFlow in your research, feel free to give us a cite.\n\n```bibtex\n@article{liang2025dataflow,\n  title={DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI},\n  author={Liang, Hao and Ma, Xiaochen and Liu, Zhou and Wong, Zhen Hao and Zhao, Zhengyang and Meng, Zimo and He, Runming and Shen, Chengyu and Cai, Qifeng and Han, Zhaoyang and others},\n  journal={arXiv preprint arXiv:2512.16676},\n  year={2025}\n}\n```\n\n\u003Cdiv align=\"center\">\n  \u003Csub>\n    Connect with the \n    \u003Ca href=\"https:\u002F\u002Fzwt233.github.io\u002F\" target=\"_blank\">\u003Cstrong>PKU-DCAI Research Team\u003C\u002Fstrong>\u003C\u002Fa> \n    on Xiaohongshu: \u003Cstrong>26133106768\u003C\u002Fstrong>\n  \u003C\u002Fsub>\n\u003C\u002Fdiv>\n","DataFlow 是一个基于最新大语言模型（LLM）的操作符和管道的数据准备工具。它集成了数据生成、清洗和预处理功能，支持快速构建高效的数据流水线。项目使用 Python 编写，具备简洁易用的 API 和强大的扩展性，能够轻松集成到现有的数据科学工作流中。其核心特点包括利用 LLM 提升数据处理自动化水平以及提供 Gradio 界面方便用户交互。适用于需要高质量训练数据的机器学习项目，特别是那些依赖于自然语言处理的任务场景。",2,"2026-06-11 03:41:10","high_star"]