[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71110":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":41,"readmeContent":42,"aiSummary":43,"trendingCount":16,"starSnapshotCount":16,"syncStatus":44,"lastSyncTime":45,"discoverSource":46},71110,"data-juicer","datajuicer\u002Fdata-juicer","datajuicer","Data processing for and with foundation models!  🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷","https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002F",null,"Python",6515,377,20,36,0,22,56,122,66,113.73,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40],"data","data-analysis","data-pipeline","data-processing","data-science","data-visualization","foundation-models","instruction-tuning","large-language-models","llm","llms","multi-modal","pre-training","synthetic-data","2026-06-12 04:00:59","#  Data-Juicer: The Data Operating System for the Foundation Model Era\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fpy-data-juicer\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fpy-data-juicer?logo=pypi&color=026cad\" alt=\"PyPI\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fprojects\u002Fpy-data-juicer\">\u003Cimg src=\"https:\u002F\u002Fstatic.pepy.tech\u002Fpersonalized-badge\u002Fpy-data-juicer?period=total&units=INTERNATIONAL_SYSTEM&left_color=grey&right_color=green&left_text=downloads\" alt=\"Downloads\">\u003C\u002Fa>\n   \u003Ca href=\"https:\u002F\u002Fhub.docker.com\u002Fr\u002Fdatajuicer\u002Fdata-juicer\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fdocker\u002Fv\u002Fdatajuicer\u002Fdata-juicer?logo=docker&label=Docker&color=498bdf\" alt=\"Docker\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📖_Docs-Website-026cad\" alt=\"Docs\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FOperators.html\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🧩_Operators-200+-blue\" alt=\"Operators\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🍳_Recipes-50+-brightgreen\" alt=\"Recipes\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fzh_CN\u002Fmain\u002Findex_ZH.html\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🇨🇳_文档-主页-red\" alt=\"Chinese\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14755\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS'25_Spotlight-2.0-B31B1B?logo=arxiv\" alt=\"Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint?style=flat&url=https%3A%2F%2Fgist.githubusercontent.com%2FHYLcool%2Ff856b14416f08f73d05d32fd992a9c29%2Fraw%2Ftotal_cov.json&label=coverage&logo=codecov&color=4c1\" alt=\"Coverage\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>Multimodal | Cloud-Native | AI-Ready | Large-Scale \u003C\u002Fb>\n\u003C\u002Fp>\n\nData-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as *composable infrastructure*—providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte.\n\nWhether you're deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clusters—no glue code required.\n\n> **Alibaba Cloud PAI** has deeply integrated Data-Juicer into its data processing products.  See **[Quickly submit a DataJuicer job](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fpai\u002Fuser-guide\u002Fquickly-submit-a-datajuicer-task)**.\n\n---\n\n## 🚀 Quick Start\n\n**Zero-install exploration**: \n- [JupyterLab Playground with Tutorials](http:\u002F\u002F8.138.149.181\u002F) \n- [Ask DJ Copilot](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs_index.html)\n\n**Install & run**:\n```bash\nuv pip install py-data-juicer\ndj-process --config demos\u002Fprocess_simple\u002Fprocess.yaml\n```\n\n**Or compose in Python**:\n```python\nfrom data_juicer.core.data import NestedDataset\nfrom data_juicer.ops.filter import TextLengthFilter\nfrom data_juicer.ops.mapper import WhitespaceNormalizationMapper\n\nds = NestedDataset.from_dict({\n    \"text\": [\"Short\", \"This passes the filter.\", \"Text   with   spaces\"]\n})\nres_ds = ds.process([\n    TextLengthFilter(min_len=10),\n    WhitespaceNormalizationMapper()\n])\n\nfor s in res_ds:\n    print(s)\n```\n\n\n---\n\n## ✨ Why Data-Juicer?\n\n### 1. Modular & Extensible Architecture\n- **200+ operators** spanning text, image, audio, video, and multimodal data\n- **Recipe-first**: Reproducible YAML pipelines you can version, share, and fork like code\n- **Composable**: Drop in a single operator, chain complex workflows, or orchestrate full pipelines\n- **Hot-reload**: Iterate on operators without pipeline restarts\n\n### 2. Full-Spectrum Data Intelligence\n- **Foundation Models**: Pre-training, fine-tuning, RL, and evaluation-grade curation\n- **Agent Systems**: Clean tool traces, structure context, de-identification, and quality gating\n- **RAG & Analytics**: Extraction, normalization, semantic chunking, deduplication, and data profiling\n\n\n### 3. Production-Ready Performance\n- **Scale**: Process 70B samples in 2h on 50 Ray nodes (6400 cores)\n- **Efficiency**: Deduplicate 5TB in 2.8h using 1280 cores\n- **Optimization**: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness\n- **Observability**: Built-in tracing for debugging, auditing, and iterative improvement\n\n> *⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo.* It helps more people discover the project and keeps you notified of new releases and features.\n\n---\n\n## 📰 News\n\n\u003Cdetails open>\n\u003Csummary>[2026-03-17] Release v1.5.1: \u003Cb>LaTeX OPs; Compressed Format Support; Operator Robustness Fixes\u003C\u002Fb>\u003C\u002Fsummary>\n\n* 📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle `.tex` archives and figure contexts.\n* 🗜️ Compressed dataset format support: `json[l].gz` files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.\n* 📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.\n* 🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI\u002Fsession capabilities were comprehensively redesigned for better maintainability and extensibility. See [date-juicer-agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents) for more details.\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>[2026-02-12] Release v1.5.0: \u003Cb>Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs\u003C\u002Fb>\u003C\u002Fsummary>\n\n- 🚀 *Enhanced Distributed Execution Framework* -- Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution.\n- 🤖 *Expanded Embodied AI Video Processing* -- Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling.\n- 💪🏻 *System Performance & Developer Experience Optimizations* -- Enabled batch inference, memory\u002Flog reduction, core logic refactoring, and updated documentation\u002Ftemplates.\n- 🐳 *Critical Bug Fixes & Stability Improvements* -- Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2026-02-02] Release v1.4.6: \u003Cb>Copilot, Video Bytes I\u002FO & Ray Tracing \u003C\u002Fb>\u003C\u002Fsummary>\n\n- 🤖 *Q&A Copilot* —  Now live on our [Doc Site](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Findex.html) | [DingTalk](https:\u002F\u002Fqr.dingtalk.com\u002Faction\u002Fjoingroup?code=v1,k1,N78tgW54U447gJP5aMC95B6qgQhlkVQS4+dp7qQq6MpuRVJIwrSsXmL8oFqU5ajJ&_dt_no_comment=1&origin=11?) | [Discord](https:\u002F\u002Fdiscord.gg\u002FngQbB9hEVK). Feel free to ask anything related to Data-Juicer ecosystem!  \n    - Check 🤖 [Data-Juicer Agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain) | 📃 [Deploy-ready codes](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain\u002Fqa-copilot) | 🎬[ More demos](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain\u002Fqa-copilot\u002FDEMO.md) for more details.\n- 🎬 *Video Bytes I\u002FO* — Direct bytes processing for video pipelines  \n- 🫆 *Ray Mode Tracer* — Track changed samples in distributed processing  \n- 🐳 *Enhancements & fixes* — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug\u002Fdoc fixes.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2026-01-15] Release v1.4.5: \u003Cb>20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade\u003C\u002Fb> \u003C\u002Fsummary>\n\n- *Embodied-AI OPs*: added\u002Fenhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus *S3 upload\u002Fdownload*.\n- *New Pipeline OP*: compose multiple OPs into one pipeline; introduced *Ray + vLLM* pipelines for LLM\u002FVLM inference.\n- *Docs upgrade*: moved to a unified *Sphinx-based* documentation build\u002Fdeploy workflow with isolated theme\u002Farchitecture repo.\n- *Enhancements & fixes*: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes. \n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2025-12-01] Release v1.4.4: \u003Cb>NeurIPS’25 Spotlight, 6 New Video\u002FMM OPs & S3 I\u002FO\u003C\u002Fb> \u003C\u002Fsummary>\n\n- NeurIPS'25 **Spotlight** for Data-Juicer 2.0\n- *Repo split*: sandbox\u002Frecipes\u002Fagents moved to standalone repos\n- *S3 I\u002FO* added to loader\u002Fexporter\n- *6 new video & multimodal OPs* (character detection, VGGT, whole-body pose, hand reconstruction) + docs\u002FRay\u002Fvideo I\u002FO improvements and bug fixes\n\u003C\u002Fdetails>\n\nView [All Release](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Freleases) and [News Archive](docs\u002Fnews.md)\n\n---\n\n## 🔌 Users & Ecosystems\n> The below list focuses on *developer-facing integration and usages* in *alphabetical order*.  \n> Missing your project \u002F name? Feel free to [open a PR](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fpulls) or [reach out](#contributing--community).\n\nData-Juicer plugs into your existing stack and evolves with community contributions:\n\n### Extensions\n- **[data-juicer-agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents)** — DJ Copilot and agentic workflows  \n- **[data-juicer-hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — Community recipes and best practices  \n- **[data-juicer-sandbox](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-sandbox)** — Data-model co-development with feedback loops  \n\n\n### Frameworks & Platforms\n[AgentScope](https:\u002F\u002Fgithub.com\u002Fagentscope-ai\u002Fagentscope) · [Apache Arrow](https:\u002F\u002Fgithub.com\u002Fapache\u002Farrow) · [Apache HDFS](https:\u002F\u002Fhadoop.apache.org\u002Fdocs\u002Fstable\u002Fhadoop-project-dist\u002Fhadoop-hdfs\u002FHdfsUserGuide.html) · [Apache Hudi](https:\u002F\u002Fhudi.apache.org\u002F) · [Apache Iceberg](https:\u002F\u002Ficeberg.apache.org\u002F) · [Apache Paimon](https:\u002F\u002Fpaimon.apache.org\u002F) · [Alibaba PAI](https:\u002F\u002Fwww.alibabacloud.com\u002Fen\u002Fproduct\u002Fmachine-learning?_p_lc=1) · [Delta Lake](https:\u002F\u002Fdelta.io\u002F) · [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio) · [EasyAnimate](https:\u002F\u002Fgithub.com\u002Faigc-apps\u002FEasyAnimate) · [Eval-Scope](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fevalscope) · [Huawei Ascend](https:\u002F\u002Fwww.huawei.com\u002Fen\u002Fproducts\u002Fcloud-computing-dc\u002Fatlas\u002Fascend) · [Hugging Face](https:\u002F\u002Fhuggingface.co\u002F) · [LanceDB](https:\u002F\u002Flancedb.github.io\u002Flance\u002F) · [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) · [ModelScope](https:\u002F\u002Fmodelscope.cn\u002F) · [ModelScope Swift](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift) · [NVIDIA NeMo](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo) · [Ray](https:\u002F\u002Fdocs.ray.io\u002F) · [RM-Gallery](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FRM-Gallery) · [Trinity-RFT](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FTrinity-RFT) · [Volcano Engine](https:\u002F\u002Fwww.volcengine.com\u002F)\n\n### Industry\nAlibaba Group, Ant Group, BYD Auto, ByteDance, DTSTACK, JD.com, NVIDIA, OPPO, Xiaohongshu, Xiaomi, Ximalaya, and more.\n\n### Academia\nCAS, Nanjing University, Peking University, RUC, Tsinghua University, UCAS, Zhejiang University, and more.\n\n\n###  Contributing & Community\nWe believe in *building together*. Whether you're fixing a typo, crafting a new operator, or sharing a breakthrough recipe, every contribution shapes the future of data processing.\n\nWe welcome contributions at all levels: \n- **[Good First Issues](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Flabels\u002Fgood%20first%20issue)** — Add operators, improve docs, report issues, or fix bugs\n- **[Developer Guide](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FDeveloperGuide.html)** — Optimize engines, add features, or enhance core infrastructure\n- **[DJ-Hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — Share knowledge: recipes, papers, and best practices\n- **Connect**: [Slack](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fdata-juicer\u002Fshared_invite\u002Fzt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg) · [DingTalk](https:\u002F\u002Fqr.dingtalk.com\u002Faction\u002Fjoingroup?code=v1,k1,N78tgW54U447gJP5aMC95B6qgQhlkVQS4+dp7qQq6MpuRVJIwrSsXmL8oFqU5ajJ&_dt_no_comment=1&origin=11?) · [Discord](https:\u002F\u002Fdiscord.gg\u002FngQbB9hEVK)\n\n| Discord | DingTalk |\n|:---:|:---:|\n| \u003Cimg src=\"https:\u002F\u002Fgw.alicdn.com\u002Fimgextra\u002Fi1\u002FO1CN011Oj8CB1f8Bw5JpgJA_!!6000000003961-0-tps-762-769.jpg\" width=\"100\"> | \u003Cimg src=\"https:\u002F\u002Fgw.alicdn.com\u002Fimgextra\u002Fi3\u002FO1CN01bBPoaX1EwZsiYudtd_!!6000000000416-2-tps-656-660.png\" width=\"100\"> |\n\n\nData-Juicer is made possible by the users and community:\n- **Initiated by**: Alibaba Tongyi Lab  \n- **Co-developed with**: Alibaba Cloud PAI, Anyscale (Ray team), Sun Yat-sen University, NVIDIA (NeMo team), and [contributors worldwide](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fgraphs\u002Fcontributors)\n- **Inspired by**: Apache Arrow, Ray, Hugging Face Datasets, BLOOM, RedPajama-Data, ...\n\n---\n\n\n## Documentation\n\nFor detailed documentation, please see [here](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs_index.html).\n\n**Quick Links:**\n- **[operator zoo](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FOperators.html)** — Browse 200+ operators with examples\n- **[Agent interaction quality & bad-case](demos\u002Fagent\u002FREADME.md)** — In-repo recipe, JSONL pipeline, HTML report (`demos\u002Fagent\u002F`; operators such as `agent_bad_case_signal_mapper` are also listed in [docs\u002FOperators.md](docs\u002FOperators.md))\n- **[data-juicer-hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — Community-driven recipes and best practices\n- **[developer guide](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FDeveloperGuide.html)** — Build your own code and contribute to DJ \n- **[data-juicer-cookbook](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002Ftutorial\u002FDJ-Cookbook.html)** — resource archive\n- **[awesome_llm_data](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002Fawesome_llm_data)** —  “Awesome List” for data-model co-development\n\n\n---\n\n## 📄 License & Attribution\n\nData-Juicer is released under the [Apache License 2.0](LICENSE).  \nAttribution is appreciated: please use our [badge](https:\u002F\u002Fdail-wlcb.oss-cn-wulanchabu.aliyuncs.com\u002Fdata_juicer\u002Fassets\u002FDJ-Org-Logo.jpeg), or text as \"This project uses Data-Juicer: https:\u002F\u002Fgithub.com\u002Fdatajuicer\".\n\n---\n\n## 📖 Citation\nIf you find Data-Juicer useful in your work, please cite:\n\n```bibtex\n@inproceedings{djv1,\n  title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},\n  author={Chen, Daoyuan and Huang, Yilun and Ma, Zhijian and Chen, Hesen and Pan, Xuchen and Ge, Ce and Gao, Dawei and Xie, Yuexiang and Liu, Zhaoyang and Gao, Jinyang and Li, Yaliang and Ding, Bolin and Zhou, Jingren},\n  booktitle={SIGMOD},\n  year={2024}\n}\n\n@article{djv2,\n  title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models},\n  author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Zhang, Yilei and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren},\n  journal={NeurIPS},\n  year={2025}\n}\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>More Publications\u003C\u002Fb> (Click to expand)\u003C\u002Fsummary>\n\n- (ICML'25 Spotlight) [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11784)\n\n- (CVPR'25) [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04594)\n \n- (TPAMI'25) [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.08583)\n\n- (NeurIPS'25) [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04380)\n\n- (NeurIPS'25) [MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09499)\n\n- (Benchmark Data) [HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17574)\n \n- (Benchmark Data) [DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.16915)\n\n- (Data Scaling) [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14908)\n\n\u003C\u002Fdetails>\n\n","Data-Juicer 是一个面向基础模型时代的数据处理系统，旨在将原始数据转化为可用于AI的智能信息。它提供了一系列模块化组件，支持数据清洗、合成和分析等功能，适用于整个AI生命周期的数据处理需求。Data-Juicer 支持多模态数据处理，具备云原生特性，能够无缝扩展至大规模集群环境，无需编写额外的粘合代码。特别适合需要处理大规模预训练语料库、管理代理交互记录或准备特定领域检索增强生成（RAG）索引等场景。该项目由Python语言开发，采用Apache License 2.0开源许可协议，并已被阿里云PAI深度集成于其数据处理产品中。",2,"2026-06-11 03:35:57","high_star"]