[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77404":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},77404,"HRM-Text","sapientinc\u002FHRM-Text","sapientinc","HRM-Text is a 1B text generation model based on the HRM architecture, strengthened by task completion and latent space reasoning.","",null,"Python",1193,105,23,8,0,44,156,891,183,19.08,"Apache License 2.0",false,"main",true,[27,28,29,30],"hierarchical-reasoning-model","hrm","large-language-models","pretraining","2026-06-12 02:03:43","![](.\u002Fassets\u002Fbanner.png)\n\n# HRM-Text: Efficient Pretraining Beyond Scaling\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fsapientinc.github.io\u002FHRM-Text\u002Fassets\u002FHRM_Text.pdf\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-PDF-red\" alt=\"Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fsapientinc\u002FHRM-Text-1B\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-HuggingFace-yellow\" alt=\"Model\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cstrong>🌟 Pretrain a foundation model from scratch with ~$1000. 🌠\u003C\u002Fstrong>\u003C\u002Fp>\n\nHRM-Text is a 1B text generation model based on the HRM architecture, strengthened by task completion and latent space reasoning. It offers a full pretraining framework, making foundation model pretraining accessible with 130-600x less compute and 150-900x less data. It is built upon a hierarchical recurrent architecture, PrefixLM sequence packing, FlashAttention 3 kernels, PyTorch FSDP2 training, evaluation, and checkpoint conversion tooling.\n\n![](.\u002Fassets\u002Fbenchmark_scatter.png)\n\n## Launch the Pretraining 🚀\n\n### Required Resources\n\nChoose a target size and prepare the corresponding GPU nodes.\n\n- **L, 0.6B parameters:** 8 H100s, single node, about 50 hours (~$800).\n- **XL, 1B parameters:** 16 H100s, two nodes, about 46 hours (~$1472).\n\n*Price estimation based on $2\u002FH100 hour.*\n\nThe following are benchmark results from the reference runs.\n\n| Size | GPUs | Time | GSM8k | MATH | DROP | MMLU | ARC-C | HellaSwag | Winogrande | BoolQ |\n| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |\n| **L (0.6B)** | 8 | 50 hrs | 77.6% | 51.2% | 78.6% | 56.6% | 75.9% | 52.7% | 67.6% | 85.0% |\n| **XL (1B)** | 16 | 46 hrs | 84.7% | 56.5% | 82.3% | 60.7% | 81.9% | 63.4% | 72.4% | 86.2% |\n\n> Hopper-class GPUs are the expected training target because the attention path depends on FlashAttention 3.\n\n### 1. Prepare Data\n\nHRM-Text trains from sampled, tokenized data produced by the companion `data_io` pipeline. Use `data_io` to clean, tokenize, and stratified-sample the pretraining corpus, then point HRM-Text at the sampled output.\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsapientinc\u002Fdata_io\">\u003Cimg alt=\"data_io\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-sapientinc%2Fdata__io-181717?logo=github&logoColor=white\">\u003C\u002Fa>\n\u003C\u002Fp>\n\nRecommended setups:\n\n1. **Single node:** run the data pipeline and pretraining on the same node. After tokenization, stratified-sample into that node's shared memory at `\u002Fdev\u002Fshm\u002Fsampled`.\n2. **Multi-node:** keep `data_io` and the tokenized data on shared storage. Mount or expose that directory on every pretraining node, then run stratified sampling independently on each node. Sampling is fast and deterministic, so every node produces the same in-memory training data.\n\nPlease first setup `data_io`, then run the pipeline. After tokenization, run stratified sampling on each training node.\n\n```bash\ncd \u003CDATA_IO_PATH>\npython sample_tokenized.py epochs=4 output_path=\u002Fdev\u002Fshm\u002Fsampled > show_analytics.md\n```\n\nHRM-Text uses 4 training epochs by default. If you change `epochs` in the training config, change the sampling command to match.\n\n### 2. Start the Environment\n\nSet up the same environment on every pretraining node.\n\n#### Recommended: Docker\n\nWe recommend running through the published Docker image that contains the full environment. Make sure Docker can see your GPUs, for example through NVIDIA Container Toolkit.\n\nFrom the repo's directory:\n\n```bash\ndocker run --gpus all --ipc=host --network=host -it \\\n  -v \"$PWD\":\u002Fworkspace \\\n  sapientai\u002Fhrm-text:latest\n```\n\nFor multi-node runs, mount the same shared workspace on every node. Keeping the code, tokenized data, and checkpoint directory at identical paths avoids version drift between ranks and makes FSDP2 checkpointing straightforward. A common layout is:\n\n```text\n\u002Fshared\u002F\n|-- HRM-Text\u002F\n   |--- checkpoints\u002F\n|-- data_io\u002F\n```\n\n#### Alternative: Install from Source\n\nIf you are not using Docker, first install PyTorch, CUDA, and FlashAttention 3. The tested versions are documented in [`docker\u002FDockerfile`](docker\u002FDockerfile).\n\nThen install the Python dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\n#### Check Distributed Communication\n\nFor multi-node runs, verify NCCL before starting a long job. At minimum, confirm that `torchrun` can initialize across the intended nodes. If your cluster provides `nccl-tests`, run both intra-node and inter-node bandwidth checks.\n\n#### Set Up W&B Tracking\n\nHRM-Text logs training metrics to [Weights & Biases](https:\u002F\u002Fwandb.ai\u002F). Log in before launching training:\n\n```bash\nwandb login\n```\n\nFor headless runs, get an API key from \u003Chttps:\u002F\u002Fwandb.ai\u002Fauthorize> and run:\n\n```bash\nwandb login \u003CAPI_KEY>\n```\n\n### 3. Launch Pretraining\n\nFor the **L**-size reference run on one 8xH100 node:\n\n```bash\nOMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \\\ntorchrun --nproc_per_node=8 pretrain.py arch\u002Fsize@arch=L lr=2.5e-4 global_batch_size=172032\n```\n\nFor the **XL**-size reference run on two 8xH100 nodes, run this on each node:\n\n```bash\nOMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \\\ntorchrun \\\n  --nproc_per_node=8 \\\n  --nnodes=2 \\\n  --node_rank=\u003CNODE_RANK> \\\n  --master_addr=\u003CMASTER_ADDR> \\\n  --master_port=\u003CMASTER_PORT> \\\n  pretrain.py\n```\n\nCheckpoints are saved every epoch under `checkpoints\u002F`. Remember for multi-node runs, each node only saves its own shard, so we recommend mounting a shared storage.\n\n### 4. Evaluate\n\nEvaluation loads the latest checkpoint epoch automatically when `ckpt_epoch` is not provided:\n\n```bash\npython -m evaluation.main ckpt_path=\"checkpoints\u002F...\"\n```\n\nTo run a specified set of benchmarks, append `run_only=[MATH,DROP,ARC,MMLU]` to the command\n\nEvaluation typically needs one 80 GB GPU. If evaluation runs out of memory, lower the batch size by adding `generation_config.batch_size=16`\n\nThe evaluation scripts use Hugging Face `datasets`, so benchmark data is downloaded on demand.\n\n### 5. Export to Transformers Format\n\n```bash\npython -m conversion.convert_to_hf \\\n  --ckpt_path \"checkpoints\u002F...\" \\\n  --out_dir \"\u003COUTPUT_PATH>\"\n```\n\nFor evaluation and export, EMA weights are used by default when EMA is present in the checkpoint.\n\n## Status\n\n- Training, checkpointing, and evaluation are implemented in this repository.\n- Transformers-format export is implemented in [`conversion\u002Fconvert_to_hf.py`](conversion\u002Fconvert_to_hf.py).\n- Native Transformers model support is merged and scheduled for the next release.\n- Native vLLM support for HRM-Text checkpoints is in progress.\n\n## Training Overrides\n\nThe default pretraining config is [`config\u002Fcfg_pretrain.yaml`](config\u002Fcfg_pretrain.yaml):\n\nIf `project_name`, `run_name`, or `checkpoint_path` are omitted, rank 0 derives them from the dataset path, architecture name, and a generated slug.\n\nHydra overrides can be passed directly on the command line:\n\n```bash\n# Train a vanilla Transformer architecture, size L\ntorchrun --nproc_per_node=8 pretrain.py \\\n  arch\u002Fnet@arch=transformer \\\n  arch\u002Fsize@arch=L\n```\n\n## Model Configurations\n\nArchitectures live under [`config\u002Farch\u002Fnet`](config\u002Farch\u002Fnet):\n\n| Config | Model |\n| --- | --- |\n| `hrm` | HRM-Text |\n| `transformer` | Standard Transformer wrapper |\n| `trm` | Tiny Recursive Model baseline |\n| `trm_match_recurrence` | TRM configured to match HRM recurrence with half parameters |\n| `rins` | Recursive Inference Scaling (RINS) baseline |\n| `ut` | Universal Transformer baseline |\n\nSizes live under [`config\u002Farch\u002Fsize`](config\u002Farch\u002Fsize):\n\n| Config | Layers | Hidden | Heads |\n| --- | ---: | ---: | ---: |\n| `B` | 12 | 1024 | 8 |\n| `L` | 24 | 1280 | 10 |\n| `XL` | 32 | 1536 | 12 |\n| `XXL` | 72 | 1792 | 14 |\n| `XXL_wide` | 32 | 2560 | 20 |\n\nFor HRM and RINS, `half_layers: true` splits the configured layer count evenly between the H and L modules.\n\n## Repository Layout\n\n```text\nHRM-Text\u002F\n|-- config\u002F                       # Hydra configs for model, data, and training\n|-- conversion\u002Fconvert_to_hf.py    # FSDP2 checkpoint -> HF-style export\n|-- evaluation\u002F                    # Evaluation engines, benchmark wrappers, configs\n|-- models\u002F                        # HRM, recurrent baselines, Transformer blocks, LM head\n|-- docker\u002F                        # Tested CUDA\u002FPyTorch\u002FFlashAttention environment\n|-- dataset_new.py                 # PrefixLM packed dataset loader\n|-- multipack_sampler.py           # Distributed multipack batch sampler\n|-- pretrain.py                    # FSDP2 pretraining entrypoint\n|-- simple_inference_engine.py     # Checkpoint loader and compiled generation engine\n`-- requirements.txt\n```\n\n## Technical Notes\n\n- [`dataset_new.py`](dataset_new.py) loads sampled `tokens.npy` and per-epoch index arrays, builds PrefixLM batches, masks instruction tokens by default, and emits FlashAttention sequence metadata.\n- [`multipack_sampler.py`](multipack_sampler.py) implements distributed multipack batching with LPT allocation to improve token-slot utilization and balance quadratic attention work.\n- [`models\u002Fflash_attention_prefixlm_v2.py`](models\u002Fflash_attention_prefixlm_v2.py) implements the two-pass PrefixLM attention path: one bidirectional pass over the prefix region and one causal pass over the response region.\n- [`models\u002Flayers.py`](models\u002Flayers.py) contains RoPE, gated multi-head attention, SwiGLU MLPs, static KV cache helpers, and initialization utilities.\n- [`models\u002Fbaselines\u002Fhrm_nocarry_bp_warmup.py`](models\u002Fbaselines\u002Fhrm_nocarry_bp_warmup.py) contains the main HRM-Text architecture.\n- [`models\u002Flm_head.py`](models\u002Flm_head.py) attaches scaled embeddings, the output head, cross-entropy loss, token accuracy, and sequence exact accuracy.\n- [`pretrain.py`](pretrain.py) handles FSDP2 wrapping, optimizer creation, LR schedule, W&B logging, code\u002Fconfig snapshots, and distributed checkpointing.\n\n## Contributions\n\nWe welcome contributions that make HRM-Text faster, stronger, or easier to use.\n\nPlease send data-pipeline changes to the companion `data_io` project. Send model, training, inference, evaluation, conversion, infrastructure, and documentation changes here.\n\nRecommended PR categories:\n\n- **Docs and tutorials:** clarify setup, data prep, launch recipes, evaluation, or checkpoint conversion.\n- **Evaluation and inference:** add benchmark wrappers, improve generation throughput, reduce VRAM, or improve result reporting.\n- **Training infrastructure:** improve FSDP2 stability, efficiency, checkpointing, launch ergonomics, logging, or cluster portability.\n- **Model and optimizer changes:** improve the architecture, recurrence schedule, initialization, attention path, optimizer, or training hyperparameters.\n\nFor changes that alter pretraining behavior, we strongly recommend running pretraining at an appropriate scale and including downstream benchmark comparisons against the reference.\n\nFor infrastructure changes intended to be behavior-preserving, include before\u002Fafter speed, memory, or stability measurements and show that benchmark quality does not regress.\n\nFor model-quality changes, we evaluate whether the change improves the Pareto frontier of training compute versus performance. Strict improvements and high-ROI changes are good candidates for defaults; valuable tradeoffs with higher cost or lower performance may belong in separate configs.\n\n## Paper\n\nThe full paper is available here:\n\n[📄 View PDF](https:\u002F\u002Fsapientinc.github.io\u002FHRM-Text\u002Fassets\u002FHRM_Text.pdf)\n\n## Citation\n\nCitation information will be added with the accompanying paper.\n\n## License\n\nApache License 2.0\n","HRM-Text 是一个基于 HRM 架构的10亿参数文本生成模型，通过任务完成和潜在空间推理得到增强。该项目提供了一个完整的预训练框架，使得基础模型的预训练可以在计算资源减少130-600倍、数据量减少150-900倍的情况下进行。它采用了层次递归架构、PrefixLM序列打包技术、FlashAttention 3内核以及PyTorch FSDP2训练工具等先进技术。HRM-Text适用于需要高效低成本地从头开始训练大规模语言模型的场景，特别适合那些希望在有限预算下（约1000美元）构建高质量文本生成模型的研究者或开发者使用。",2,"2026-06-11 03:55:25","CREATED_QUERY"]