[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80695":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":13,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":31,"discoverSource":32},80695,"data_io","sapientinc\u002Fdata_io","sapientinc","Data pipeline for HRM-Text pretraining","",null,"Python",55,7,1,0,2,9,10,2.71,"Apache License 2.0",false,"main",true,[25,26,27],"data","large-language-models","pretraining","2026-06-12 02:04:05","![](.\u002Fassets\u002Fbanner.png)\n\n# Data IO\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.20613\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-red?logo=arxiv&logoColor=white\" alt=\"arXiv Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fsapientinc\u002FHRM-Text-1B\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-HuggingFace-yellow\" alt=\"Model\">\u003C\u002Fa>\n\u003C\u002Fp>\n\nThis is the data pipeline used in the pretraining process of HRM-Text. Unlike LLM pretraining pipelines that ingest web documents for language modeling, HRM-Text Data IO produces instruction-style question-answer pairs and builds sampled tokenized datasets for training.\n\n## Overview\n\nThe pipeline consists of four main stages:\n\n1. **Data Cleaning**: Convert raw datasets into standardized instruction\u002Fresponse format\n2. **Tokenizer Training**: Train BPE tokenizer\n3. **Tokenization**: Convert text to token IDs using a Rust-based high-performance tokenizer\n4. **Stratified Sampling**: Create balanced training datasets with configurable sampling strategies\n\n### Directory Structure\n\n```\ndata_io\u002F\n├── pipe\u002F                     # Data cleaning scripts (legacy, small datasets)\n├── pipe_clustered\u002F           # Data cleaning scripts (large clustered datasets)\n├── raw_data\u002F                 # Raw, source datasets\n├── data\u002F                     # Cleaned legacy, small datasets (JSONL format)\n├── data_clustered\u002F           # Cleaned large-scale datasets (Parquet format)\n├── tokenizer\u002F                # Rust tokenizer implementation\n├── trained_tokenizers\u002F       # Trained tokenizers\n├── data_tokenized_*\u002F         # Tokenized output (numpy arrays and metadata)\n├── prefix_config.yaml        # Stratified sampling configuration\n└── sample_tokenized.py       # Stratified sampling & epoch creation\n```\n\n## Guidelines\n\nBefore you start, please make sure that you are in the project directory and have installed pip requirements:\n\n```bash\ncd data_io\npip install -r requirements.txt\n```\n\nInstall Rust\u002FCargo before tokenizer training or tokenization.\n\n> 💡 The cleaning scripts requires ~512GiB of RAM. You can download [cleaned data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsapientinc\u002FHRM-Text-data-io-cleaned-20260515) to skip Raw Data Preparation and Data Cleaning and go directly to Tokenization.\n\n### Raw Data Preparation\n\nMost cleaning scripts read source datasets from Hugging Face Hub. Some scripts read local files from `raw_data\u002F`.\n\nThe following local raw datasets are required before running the full cleaning pipeline.\n\n```bash\n# FLAN\nhf download Open-Orca\u002FFLAN --repo-type dataset --local-dir .\u002Fraw_data\u002FFLAN\n# SYNTH\nhf download PleIAs\u002FSYNTH --repo-type dataset --local-dir .\u002Fraw_data\u002FSYNTH\n# Platypus\nmkdir -p .\u002Fraw_data\u002FPlatypus\nhf download imone\u002FARB --repo-type dataset --local-dir .\u002Fraw_data\u002FPlatypus\u002FARB\ngit clone https:\u002F\u002Fgithub.com\u002Fmandyyyyii\u002Fscibench.git .\u002Fraw_data\u002FPlatypus\u002Fscibench\n```\n\nDownload the following and unzip to `raw_data\u002F`.\n\n- [amps.tar.gz](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1hQsua3TkpEmcJD_UWQx8dmNdEZPyxw23\u002Fview?usp=sharing)\n- [mathematics_dataset-v1.0.tar.gz](https:\u002F\u002Fconsole.cloud.google.com\u002Fstorage\u002Fbrowser\u002Fmathematics-dataset)\n\n### Data Cleaning\n\nTransform raw datasets into standardized format using the cleaning scripts. Run the needed scripts in `pipe` and `pipe_clustered`, for example:\n\n```bash\npython -m pipe.clean_platypus.clean_arb\npython -m pipe.clean_gsm8k_train\npython -m pipe.clean_math_train\n# ... other cleaners\n\npython -m pipe_clustered.clean_acereason\n# ... other clustered cleaners\n```\n\nCleaned data is written to `data\u002F` and `data_clustered\u002F`.\n\n**Output Format:**\n\nJSON:\n```jsonc\n{\n  \"condition\": \"cot,noisy\",  \u002F\u002F tags attached to this item, separated by comma\n  \"instruction\": \"Question or prompt text\",\n  \"response\": \"Answer or completion text\"\n}\n```\nParquet: Same as above, in columnar format.\n\n### (OPTIONAL) Tokenizer training\n\nTrained tokenizers are already in `trained_tokenizers\u002F`. **Optional:** If you want to train a new one, run the following:\n\n```bash\n(cd tokenizer && cargo run --release --bin train_tokenizer -- ..\u002Fdata ..\u002Fdata_clustered -o ..\u002Ftrained_tokenizers\u002Fbpe\u002Ftokenizer.json)\n```\n\n### Tokenization\n\nConvert text to token IDs using the high-performance Rust tokenizer:\n\n```bash\n(cd tokenizer && cargo run --release --bin tokenizer -- ..\u002Fdata ..\u002Fdata_clustered --tokenizer-path ..\u002Ftrained_tokenizers\u002Fbpe\u002Ftokenizer.json -o ..\u002Fdata_tokenized_bpe_65k)\n```\n\nIt supports incremental processing. When source data changes, it will remove orphans and re-tokenize newly updated files.\n\n**Output:** For each source `.jsonl` or `.parquet` file, creates one output subdirectory containing:\n- `tokens.npy`: Concatenated token IDs\n- `inst_start.npy`, `inst_len.npy`: Instruction boundaries\n- `resp_start.npy`, `resp_len.npy`: Response boundaries\n- `metadata.json`: For caching only (source file modification time, size)\n\nThe output root also contains `tokenizer_info.json`.\n\n### (ON TRAINING NODES ONLY) Stratified Sampling\n\n**On each node that is about to launch training**, create balanced training datasets from tokenized dataset with stratified sampling in memory (`\u002Fdev\u002Fshm`):\n\n```bash\npython sample_tokenized.py epochs=10 > show_analytics.md\n```\n\nOverride configuration values with `key=value` arguments ([OmegaConf CLI argument format](https:\u002F\u002Fomegaconf.readthedocs.io\u002Fen\u002F2.3_branch\u002Fusage.html#id15)).\n\n**Configuration Options:**\n```python\ntokenized_path: str = \"data_tokenized_bpe_65k\" # Input directory\noutput_path: str = \"\u002Fdev\u002Fshm\u002Fsampled\"          # Output directory (RAM disk)\nprefix_config_path: str = \"prefix_config.yaml\" # Stratified sampling configuration\n\nseed: int = 0                                  # Random seed\nepochs: int = 10                               # Number of training epochs\n\ncontext_size: int = 4096 + 1                   # Max sequence length (including +1 AR shift)\nmin_resp_length: int = 2                       # Minimum response length. All responses shorter than this will be dropped. Default: at least one content token + an EOS = 2 tokens\n```\n\n**Stratified sampling configuration file (specified in prefix_config_path):**\n\nThe sampler matches file prefixes in order. Once a match is found, the following rules apply:\n\n```python\nmax_per_file: Optional[int] = None  # Maximum rows to sample from this file per epoch\nlong_context: Literal[\"drop\", \"truncate\"] = \"truncate\"  # What to do if the context exceeds maximum\nrepeat: int = 1  # Repeat the dataset for X times. Used for upsampling small datasets\n```\n\n**Output:**\n- `tokens.npy`: Concatenated token array (memory-mapped)\n- `epoch_N\u002F`: Per-epoch index arrays (inst_start, inst_len, resp_start, resp_len)\n- `metadata.json`: Dataset statistics (vocab size, max sequence length, total tokens)\n\n**Analytics:** The script writes Markdown statistics to stdout, usually redirected to a file:\n\nReports include:\n- Coverage statistics by category and task\n- Total unique rows and tokens sampled\n\n## Citation\n\nIf you find this project or our paper useful, please consider citing our paper:\n\n```\n@misc{wang2026hrmtextefficientpretrainingscaling,\n      title={HRM-Text: Efficient Pretraining Beyond Scaling}, \n      author={Guan Wang and Changling Liu and Chenyu Wang and Cai Zhou and Yuhao Sun and Yifei Wu and Shuai Zhen and Luca Scimeca and Yasin Abbasi Yadkori},\n      year={2026},\n      eprint={2605.20613},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.20613}, \n}\n```\n\n## Contributions\n\nWe welcome contributions to scale and improve this pretraining pipeline! Because pretraining data quality directly impacts model performance, we require validation for all changes. Please align your Pull Request with one of the following categories:\n\n### 1. Optimizations (No Result Change)\n\n*For code refactoring, speedups, or memory footprint reductions.*\n\n**Rule:** The final output must remain identical to the main branch.\n**Validation Required:**\n* Provide before\u002Fafter performance metrics (execution time, peak RAM usage).\n* Prove output equivalence by verifying the checksums (e.g., SHA256) of the generated `.npy` arrays.\n\n### 2. Major Changes (Behavior Modifying)\n\n*For modifying sampling strategies, updating the tokenizer, or adding\u002Faltering datasets.*\n\n**Rule:** Any change that alters the token distribution, vocabulary, or sequence boundaries must be treated as a breaking change. **Validation & Benchmarking:**\n* **Analytics:** Attach the complete Markdown output generated by `sample_tokenized.py` to your PR to show dataset coverage and sampled-token counts.\n* **Model Evaluation:** It is strongly recommended to conduct a pretraining run at any scale and provide downstream benchmark results comparing the baseline to your proposed changes.\n* **Pareto Efficiency & Merging:** We evaluate data modifications based on their position on the Pareto frontier of compute cost (training tokens) versus performance.\n  * **Main Branch:** We merge highly efficient changes directly into `main`. This includes strict improvements (fewer tokens yielding better or equal performance) and high-ROI additions (a slight increase in tokens yielding a large performance jump).\n  * **Alternative Branches:** Changes that push the frontier inward at lower compute but reduced performance, or outward at a high compute cost for better performance are valuable but will be merged into separate, dedicated branches rather than `main`.\n\n### Submitting Your PR\n\nTitle your PR with a clear prefix (e.g., `[Opt]` or `[Major]`) and include the required validation proofs in the description. For other types of changes, please open an issue to discuss.\n\n## License\n\nApache 2.0\n","该项目是一个用于HRM-Text预训练的数据流水线，与传统的基于网页文档的大语言模型预训练不同，它生成指令式问答对，并构建采样后的分词数据集。核心功能包括数据清洗、BPE分词器训练、文本到token ID的转换以及分层采样策略下的平衡训练数据集创建。技术特点在于使用了Rust实现的高性能分词器和灵活的配置选项以适应不同的采样需求。适用于需要高质量、结构化数据输入的语言模型预训练场景，特别是对于那些旨在提高特定任务性能（如问答）的模型来说尤为适用。","2026-06-11 04:01:41","CREATED_QUERY"]