[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-3357":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":16,"starSnapshotCount":16,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},3357,"easy-dataset","ConardLi\u002Feasy-dataset","ConardLi","A powerful tool for creating datasets for LLM fine-tuning 、RAG and Eval","https:\u002F\u002Fdocs.easy-dataset.com",null,"JavaScript",14439,1471,65,109,0,7,39,230,31,113,"Other",false,"main",true,[27,28,29,30,31],"dataset","fine-tuning","javascript","llm","rag","2026-06-12 04:00:17","\u003Cdiv align=\"center\">\n\n![](.\u002Fpublic\u002F\u002Fimgs\u002Fbg2.png)\n\n\u003Cimg alt=\"GitHub Repo stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FConardLi\u002Feasy-dataset\">\n\u003Cimg alt=\"GitHub Downloads (all assets, all releases)\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdownloads\u002FConardLi\u002Feasy-dataset\u002Ftotal\">\n\u003Cimg alt=\"GitHub Release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FConardLi\u002Feasy-dataset\">\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-AGPL--3.0-green.svg\" alt=\"AGPL 3.0 License\"\u002F>\n\u003Cimg alt=\"GitHub contributors\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FConardLi\u002Feasy-dataset\">\n\u003Cimg alt=\"GitHub last commit\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002FConardLi\u002Feasy-dataset\">\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04009v1\" target=\"_blank\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2507.04009-b31b1b.svg\" alt=\"arXiv:2507.04009\">\n\u003C\u002Fa>\n\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F13944\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F13944\" alt=\"ConardLi%2Feasy-dataset | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\n**A powerful tool for creating fine-tuning datasets for Large Language Models**\n\n[简体中文](.\u002FREADME.zh-CN.md) | [English](.\u002FREADME.md) | [Türkçe](.\u002FREADME.tr.md)\n\n[Features](#features) • [Quick Start](#local-run) • [Documentation](https:\u002F\u002Fdocs.easy-dataset.com\u002Fed\u002Fen) • [Contributing](#contributing) • [License](#license)\n\nIf you like this project, please give it a Star⭐️, or buy the author a coffee => [Donate](.\u002Fpublic\u002Fimgs\u002Faw.jpg) ❤️!\n\n\u003C\u002Fdiv>\n\n## Overview\n\nEasy Dataset is an application specifically designed for building large language model (LLM) datasets. It features an intuitive interface, along with built-in powerful document parsing tools, intelligent segmentation algorithms, data cleaning and augmentation capabilities. The application can convert domain-specific documents in various formats into high-quality structured datasets, which are applicable to scenarios such as model fine-tuning, retrieval-augmented generation (RAG), and model performance evaluation.\n\n![](.\u002Fpublic\u002Fimgs\u002Farc3.png)\n\n## News\n\n🎉🎉 Easy Dataset Version 1.7.0 launches brand-new evaluation capabilities! You can effortlessly convert domain-specific documents into evaluation datasets (test sets) and automatically run multi-dimensional evaluation tasks. Additionally, it comes with a human blind test system, enabling you to easily meet needs such as vertical domain model evaluation, post-fine-tuning model performance assessment, and RAG recall rate evaluation. Tutorial: [https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1CRrVB7Eb4\u002F](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1CRrVB7Eb4\u002F)\n\n## Features\n\n### 📄 Document Processing & Data Generation\n\n- **Intelligent Document Processing**: Supports PDF, Markdown, DOCX, TXT, EPUB and more formats with intelligent recognition\n- **Intelligent Text Splitting**: Multiple splitting algorithms (Markdown structure, recursive separators, fixed length, code-aware chunking), with customizable visual segmentation\n- **Intelligent Question Generation**: Auto-extract relevant questions from text segments, with question templates and batch generation\n- **Domain Label Tree**: Intelligently builds global domain label trees based on document structure, with auto-tagging capabilities\n- **Answer Generation**: Uses LLM API to generate comprehensive answers and Chain of Thought (COT), with AI optimization\n- **Data Cleaning**: Intelligent text cleaning to remove noise and improve data quality\n\n### 🔄 Multiple Dataset Types\n\n- **Single-Turn QA Datasets**: Standard question-answer pairs for basic fine-tuning\n- **Multi-Turn Dialogue Datasets**: Customizable roles and scenarios for conversational format\n- **Image QA Datasets**: Generate visual QA data from images, with multiple import methods (directory, PDF, ZIP)\n- **Data Distillation**: Generate label trees and questions directly from domain topics without uploading documents\n\n### 📊 Model Evaluation System\n\n- **Evaluation Datasets**: Generate true\u002Ffalse, single-choice, multiple-choice, short-answer, and open-ended questions\n- **Automated Model Evaluation**: Use Judge Model to automatically evaluate model answer quality with customizable scoring rules\n- **Human Blind Test (Arena)**: Double-blind comparison of two models' answers for unbiased evaluation\n- **AI Quality Assessment**: Automatic quality scoring and filtering of generated datasets\n\n### 🛠️ Advanced Features\n\n- **Custom Prompts**: Project-level customization of all prompt templates (question generation, answer generation, data cleaning, etc.)\n- **GA Pair Generation**: Genre-Audience pair generation to enrich data diversity\n- **Task Management Center**: Background batch task processing with monitoring and interruption support\n- **Resource Monitoring Dashboard**: Token consumption statistics, API call tracking, model performance analysis\n- **Model Testing Playground**: Compare up to 3 models simultaneously\n\n### 📤 Export & Integration\n\n- **Multiple Export Formats**: Alpaca, ShareGPT, Multilingual-Thinking formats with JSON\u002FJSONL file types\n- **Balanced Export**: Configure export counts per tag for dataset balancing\n- **LLaMA Factory Integration**: One-click LLaMA Factory configuration file generation\n- **Hugging Face Upload**: Direct upload datasets to Hugging Face Hub\n\n### 🤖 Model Support\n\n- **Wide Model Compatibility**: Compatible with all LLM APIs that follow the OpenAI format\n- **Multi-Provider Support**: OpenAI, MiniMax, Ollama (local models), Zhipu AI, Alibaba Bailian, OpenRouter, and more\n- **Vision Models**: Support Gemini, Claude, etc. for PDF parsing and image QA\n\n### 🌐 User Experience\n\n- **User-Friendly Interface**: Modern, intuitive UI designed for both technical and non-technical users\n- **Multi-Language Support**: Complete Chinese, English, Turkish and Portuguese language support 🇹🇷\n- **Dataset Square**: Discover and explore public dataset resources\n- **Desktop Clients**: Available for Windows, macOS, and Linux\n\n## Quick Demo\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6ddb1225-3d1b-4695-90cd-aa4cb01376a8\n\n## Local Run\n\n### Download Client\n\n\u003Ctable style=\"width: 100%\">\n  \u003Ctr>\n    \u003Ctd width=\"20%\" align=\"center\">\n      \u003Cb>Windows\u003C\u002Fb>\n    \u003C\u002Ftd>\n    \u003Ctd width=\"30%\" align=\"center\" colspan=\"2\">\n      \u003Cb>MacOS\u003C\u002Fb>\n    \u003C\u002Ftd>\n    \u003Ctd width=\"20%\" align=\"center\">\n      \u003Cb>Linux\u003C\u002Fb>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr style=\"text-align: center\">\n    \u003Ctd align=\"center\" valign=\"middle\">\n      \u003Ca href='https:\u002F\u002Fgithub.com\u002FConardLi\u002Feasy-dataset\u002Freleases\u002Flatest'>\n        \u003Cimg src='.\u002Fpublic\u002Fimgs\u002Fwindows.png' style=\"height:24px; width: 24px\" \u002F>\n        \u003Cbr \u002F>\n        \u003Cb>Setup.exe\u003C\u002Fb>\n      \u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" valign=\"middle\">\n      \u003Ca href='https:\u002F\u002Fgithub.com\u002FConardLi\u002Feasy-dataset\u002Freleases\u002Flatest'>\n        \u003Cimg src='.\u002Fpublic\u002Fimgs\u002Fmac.png' style=\"height:24px; width: 24px\" \u002F>\n        \u003Cbr \u002F>\n        \u003Cb>Intel\u003C\u002Fb>\n      \u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" valign=\"middle\">\n      \u003Ca href='https:\u002F\u002Fgithub.com\u002FConardLi\u002Feasy-dataset\u002Freleases\u002Flatest'>\n        \u003Cimg src='.\u002Fpublic\u002Fimgs\u002Fmac.png' style=\"height:24px; width: 24px\" \u002F>\n        \u003Cbr \u002F>\n        \u003Cb>M\u003C\u002Fb>\n      \u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" valign=\"middle\">\n      \u003Ca href='https:\u002F\u002Fgithub.com\u002FConardLi\u002Feasy-dataset\u002Freleases\u002Flatest'>\n        \u003Cimg src='.\u002Fpublic\u002Fimgs\u002Flinux.png' style=\"height:24px; width: 24px\" \u002F>\n        \u003Cbr \u002F>\n        \u003Cb>AppImage\u003C\u002Fb>\n      \u003C\u002Fa>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### Install with NPM\n\n1. Clone the repository:\n\n```bash\n   git clone https:\u002F\u002Fgithub.com\u002FConardLi\u002Feasy-dataset.git\n   cd easy-dataset\n```\n\n2. Install dependencies:\n\n```bash\n   npm install\n```\n\n3. Start the development server:\n\n```bash\n   npm run build\n\n   npm run start\n```\n\n4. Open your browser and visit `http:\u002F\u002Flocalhost:1717`\n\n### Using the Official Docker Image\n\n1. Clone the repository:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FConardLi\u002Feasy-dataset.git\ncd easy-dataset\n```\n\n2. Modify the `docker-compose.yml` file:\n\n```yml\nservices:\n  easy-dataset:\n    image: ghcr.io\u002Fconardli\u002Feasy-dataset\n    container_name: easy-dataset\n    ports:\n      - '1717:1717'\n    volumes:\n      - .\u002Flocal-db:\u002Fapp\u002Flocal-db\n      - .\u002Fprisma:\u002Fapp\u002Fprisma\n    restart: unless-stopped\n```\n\n> **Note:** It is recommended to use the `local-db` and `prisma` folders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.\n\n> **Note:** The database file will be automatically initialized on first startup, no need to manually run `npm run db:push`.\n\n3. Start with docker-compose:\n\n```bash\ndocker-compose up -d\n```\n\n4. Open a browser and visit `http:\u002F\u002Flocalhost:1717`\n\n### Building with a Local Dockerfile\n\nIf you want to build the image yourself, use the Dockerfile in the project root directory:\n\n1. Clone the repository:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FConardLi\u002Feasy-dataset.git\ncd easy-dataset\n```\n\n2. Build the Docker image:\n\n```bash\ndocker build -t easy-dataset .\n```\n\n3. Run the container:\n\n```bash\ndocker run -d \\\n  -p 1717:1717 \\\n  -v .\u002Flocal-db:\u002Fapp\u002Flocal-db \\\n  -v .\u002Fprisma:\u002Fapp\u002Fprisma \\\n  --name easy-dataset \\\n  easy-dataset\n```\n\n> **Note:** It is recommended to use the `local-db` and `prisma` folders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.\n\n> **Note:** The database file will be automatically initialized on first startup, no need to manually run `npm run db:push`.\n\n4. Open a browser and visit `http:\u002F\u002Flocalhost:1717`\n\n## Documentation\n\n- View the demo video of this project: [Easy Dataset Demo Video](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1y8QpYGE57\u002F)\n- For detailed documentation on all features and APIs, visit our [Documentation Site](https:\u002F\u002Fdocs.easy-dataset.com\u002Fed\u002Fen)\n- View the paper of this project: [Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04009v1)\n\n## Community Practice\n\n- [Complete test set generation and model evaluation with Easy Dataset](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1CRrVB7Eb4\u002F)\n- [Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge](https:\u002F\u002Fbuaa-act.feishu.cn\u002Fwiki\u002FGVzlwYcRFiR8OLkHbL6cQpYin7g)\n- [Easy Dataset Practical Guide: How to Build High-Quality Datasets?](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1MRMnz1EGW)\n- [Interpretation of Key Feature Updates in Easy Dataset](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1fyJhzHEb7\u002F)\n- [Foundation Models Fine-tuning Datasets: Basic Knowledge Popularization](https:\u002F\u002Fdocs.easy-dataset.com\u002Fzhi-shi-ke-pu)\n\n## Contributing\n\nWe welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:\n\n1. Fork the repository\n2. Create a new branch (`git checkout -b feature\u002Famazing-feature`)\n3. Make your changes\n4. Commit your changes (`git commit -m 'Add some amazing feature'`)\n5. Push to the branch (`git push origin feature\u002Famazing-feature`)\n6. Open a Pull Request (submit to the DEV branch)\n\nPlease ensure that tests are appropriately updated and adhere to the existing coding style.\n\n## Join Discussion Group & Contact the Author\n\nhttps:\u002F\u002Fdocs.easy-dataset.com\u002Fgeng-duo\u002Flian-xi-wo-men\n\n## License\n\nThis project is licensed under the AGPL 3.0 License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf this work is helpful, please kindly cite as:\n\n```bibtex\n@misc{miao2025easydataset,\n  title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},\n  author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},\n  year={2025},\n  eprint={2507.04009},\n  archivePrefix={arXiv},\n  primaryClass={cs.CL},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04009}\n}\n```\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=ConardLi\u002Feasy-dataset&type=Date)](https:\u002F\u002Fwww.star-history.com\u002F#ConardLi\u002Feasy-dataset&Date)\n\n\u003Cdiv align=\"center\">\n  \u003Csub>Built with ❤️ by \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FConardLi\">ConardLi\u003C\u002Fa> • Follow me: \u003Ca href=\".\u002Fpublic\u002Fimgs\u002Fweichat.jpg\">WeChat Official Account\u003C\u002Fa>｜\u003Ca href=\"https:\u002F\u002Fspace.bilibili.com\u002F474921808\">Bilibili\u003C\u002Fa>｜\u003Ca href=\"https:\u002F\u002Fjuejin.cn\u002Fuser\u002F3949101466785709\">Juejin\u003C\u002Fa>｜\u003Ca href=\"https:\u002F\u002Fwww.zhihu.com\u002Fpeople\u002Fwen-ti-chao-ji-duo-de-xiao-qi\">Zhihu\u003C\u002Fa>｜\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002F@garden-conard\">Youtube\u003C\u002Fa>\u003C\u002Fsub>\n\u003C\u002Fdiv>\n","Easy Dataset 是一个专为构建大型语言模型（LLM）数据集而设计的强大工具。它提供了直观的用户界面，内置了强大的文档解析工具、智能分段算法以及数据清洗和增强功能。该项目支持多种文档格式如PDF、Markdown、DOCX等，并能将特定领域的文档转换成高质量的结构化数据集，适用于模型微调、检索增强生成（RAG）及模型性能评估等多种场景。其最新版本1.7.0引入了全新的评估能力，包括自动多维度任务评估与人工盲测系统，进一步增强了对垂直领域模型评估的支持。",2,"2026-06-11 02:53:48","top_language"]