[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79502":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":16,"compositeScore":18,"rankGlobal":8,"rankLanguage":8,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":8,"pushedAt":8,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":26,"discoverSource":27},79502,"ForgeTrain","OpenBMB\u002FForgeTrain","OpenBMB",null,"Python",228,21,4,2,0,3,11,200,62.53,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:25","\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Flogo.png\" alt=\"ForgeTrain\" width=\"90%\">\n\u003C\u002Fp>\n\n# ForgeTrain\n\n### An LLM Pretraining Framework Built End-to-End by an Autonomous Agent Loop + a matching Harness scaffolding *(coming soon)*\n\n**🤖 100% AI-Authored · 🚀 44.13% MFU on H100 · 📈 +10% over Megatron-LM · ✅ Production-Validated**\n\n[English](.\u002FREADME.md) | [中文](.\u002FREADME_zh.md)\n\n---\n\n> **An LLM pretraining framework written end-to-end by an AI Agent Loop with zero human edits — plus the Harness that produced the pretraining framework *(coming soon)*.**\n>\n> **Current release: v0.1.0** (NVIDIA H100 · MiniCPM4-0.5B \u002F MiniCPM4-8B training frameworks; matching Harness *coming soon*)\n\n---\n\n## ✨ Highlights\n\n- 🤖 **100% Agent-Loop Authored** — the entire framework produced by an AI Agent running in auto-loop mode, with zero manual edits\n- 🔄 **Self-Diagnosing Agent Loop** — read reference → implement → launch job → parse logs → root-cause → patch → pass gate → commit, fully autonomous\n- 🚀 **44.13% MFU on H100** — ~10% above the Megatron-LM baseline (~40%), validated on 64× H100 with BF16, DP-only\n- ✅ **Production-Validated** — MiniCPM4-0.5B fully pretrained, real model weights produced (not a demo)\n- 🛠️ **GEMM + Attention kernels authored by the agent loop** — **per-op MFU up to 90%**; FlashAttention written from scratch, outperforms Transformer Engine \u002F FA3, on par with FA4\n\n---\n\n## 🗺️ Roadmap\n\n- **Reproduction live demo**\n- **Huawei MiniCPM5-1B training framework**\n- **Training framework self-generates the Harness scaffolding**\n\n---\n\n## Feature Comparison\n\n| Feature | **ForgeTrain** | Megatron-LM |\n|---------|:-:|:-:|\n| MFU on H100 (MiniCPM4-0.5B, BF16, DP) | **44.13%** | ~40% |\n| 100% AI-Authored Code | ✅ | ❌ |\n| CuTeDSL custom GEMMs (AOT C-export) | ✅ (5 GEMMs) | ❌ |\n| Custom FlashAttention (on par with FA4) | ✅ (self-built CuTeDSL impl) | ❌ (uses upstream TE \u002F FA) |\n| Checkpoint → HuggingFace export | ✅ (one script) | Manual |\n\n\u003Csub>Also supports CUDA Graph, Triton fused kernels, and comm-compute overlap out of the box.\u003C\u002Fsub>\n\n> Comparison based on Megatron-LM v0.15 on the same hardware (H100, SM90). ForgeTrain v1 is scoped to MiniCPM4-0.5B (DP-only) and MiniCPM4-8B (TP=2) × BF16; Megatron-LM supports broader model families and parallelism strategies.\n\n---\n\n## 📢 News\n\n- 📌 **[2026-05] ForgeTrain v0.1.0 released** — first public release of the training engine; the Harness that produced it is *coming soon*. MiniCPM4-0.5B pretrained on 64× H100, achieving **44.13% MFU**.\n\n---\n\n## Table of Contents\n\n- [Highlights](#-highlights)\n- [Roadmap](#-roadmap)\n- [Feature Comparison](#feature-comparison)\n- [News](#-news)\n- [Agent-Friendly Quick Deploy](#-agent-friendly-quick-deploy)\n- [Repository Layout](#repository-layout)\n- [Quick Start](#quick-start)\n- [Core Technology](#core-technology)\n- [Performance](#performance)\n- [Contributing](#contributing)\n- [License](#license)\n- [Acknowledgments](#acknowledgments)\n- [Citation](#citation)\n\n---\n\n## 🤖 Agent-Friendly Quick Deploy\n\n> This repo was produced by an AI Agent and is friendliest to AI Agents. **Paste the prompt below into Cursor \u002F Claude Code \u002F Codex \u002F Cline** — it will read the README, install dependencies, run the smoke test and report the MFU, without you typing commands one at a time.\n\n**🟢 5-step minimal pretraining demo (paste into your Coding Agent)**\n\n```text\nFollowing this project's exports\u002Ftrain_engine_0.5B\u002FREADME.md,\nrun a 5-step minimal pretraining demo on the current node:\n\n1. Check the environment (Python ≥ 3.11, CUDA ≥ 12.x, H100, PyTorch ≥ 2.4)\n   and install anything missing;\n2. Install the repo: pip install -e . and HF deps: pip install datasets transformers;\n3. Import smoke test:\n   PYTHONPATH=src python -c \"from training_engine_tensor import config; print('OK')\"\n4. Run 5 steps on HF GSM8K:\n   torchrun --standalone --nproc-per-node=1 \\\n     -m training_engine_tensor pretrain \\\n     --num-steps 5 --global-batch-size 1 --micro-batch-size 1 \\\n     --seq-length 4096 \\\n     --hf-dataset openai\u002Fgsm8k --hf-dataset-config main \\\n     --hf-text-template \"Question: {question}\\nAnswer: {answer}\" \\\n     --tokenizer-path openbmb\u002FMiniCPM4-0.5B \\\n     --save-dir .\u002Fcheckpoints\u002Fdemo\n5. Print the final loss, step time, and MFU.\n\nIf anything fails, dig into the source on your own — do not ask me.\n```\n\n> Full single-node 8× H100 and multi-node commands are in the [Quick Start](#quick-start) section below.\n\n---\n\n## Repository Layout\n\nThis repo bundles a family of subprojects in a strict **producer \u002F product** relationship:\n\n| Subdirectory | Role |\n|---|---|\n| `harness\u002F` *(coming soon)* | **Harness** — the scaffolding that drives an Agent Loop to autonomously build a training framework |\n| `exports\u002Ftrain_engine_0.5B\u002F` | **TrainingEngine (0.5B)** — produced end-to-end by `harness\u002F` *(coming soon)*; targets MiniCPM4-0.5B at 44.13% MFU on 8× H100 |\n| `exports\u002Ftrain_engine_8b\u002F`   | **TrainingEngine (8B)** — also produced by `harness\u002F` *(coming soon)*; targets MiniCPM4-8B with TP=2 \u002F DP=4 at 50.9% MFU on a single 8× H100 host |\n\n```\n harness\u002F  ──(bash agent-loop.sh, zero human input)──▶  exports\u002Ftrain_engine_0.5B\u002F\n  Harness  (coming soon)                               exports\u002Ftrain_engine_8b\u002F\n  producer (gates + prompts + control plane)           product (a runnable training framework)\n```\n\nEach subdirectory has its own README with full CLI docs, config reference, layout, performance baselines, and limitations.\n\n---\n\n## Quick Start\n\n> **Environment**: Python ≥ 3.11 · CUDA 12.x · **PyTorch ≥ 2.4** · NVIDIA H100 80GB (SM90). Full pretraining requires 8× H100; early alignment stages run on a single GPU.\n\nUse the training framework directly → `exports\u002Ftrain_engine_0.5B\u002F`\n\n---\n\n### Use the training framework\n\n**Goal**: take the ready-made framework and run pretraining on your H100s.\n\n#### 1. Install\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FForgeTrain.git\ncd ForgeTrain\u002Fexports\u002Ftrain_engine_0.5B\npip install -e .\npip install datasets transformers   # HuggingFace data path (required)\n```\n\n#### 2. Verify install\n\n```bash\nPYTHONPATH=src python -c \"from training_engine_tensor import config; print('OK')\"\n```\n\nExpected output: `OK`\n\n#### 3. Precompile operators (first run only; subsequent runs reuse the cache)\n\n```bash\nPYTHONPATH=src CUSTOM_GEMM=1 OP_ATTENTION=v1 \\\n    python scripts\u002Fprecompile_ops.py\n```\n\nWarms up AOT export + `cpp_extension` builds for the 5 CuTeDSL GEMMs, persisting under `${ENGINE_ROOT}\u002F.persist_cache\u002F`. Subsequent jobs reuse the cache; only a few seconds of `dlopen` cost remains.\n\n#### 4. Single-node 8× H100, bring your own HF dataset\n\n```bash\ntorchrun --standalone --nproc-per-node=8 \\\n    -m training_engine_tensor pretrain \\\n    --num-steps 200 \\\n    --global-batch-size 1280 --micro-batch-size 10 \\\n    --seq-length 4096 \\\n    --hf-dataset openai\u002Fgsm8k \\\n    --hf-dataset-config main \\\n    --hf-text-template \"Question: {question}\\nAnswer: {answer}\" \\\n    --tokenizer-path openbmb\u002FMiniCPM4-0.5B \\\n    --save-dir .\u002Fcheckpoints\u002Frun1\n```\n\n`--hf-dataset` accepts a HuggingFace Hub name (e.g. `openai\u002Fgsm8k`) or a local Parquet \u002F Arrow \u002F JSON directory.\n\n**Expected output (200 steps on 8× H100)**\n\nEach step logs a line like:\n\n```\n[STEP 200] loss=X.XXX | step_time=XXXms | mfu=44.XX%\n```\n\nOn 8× H100 with BF16, expect **MFU ~44%** and step time ~XXXms for GBS=1280 \u002F MBS=10 \u002F seq=4096.\n\n> Full documentation including multi-node training, checkpoint resume, and HuggingFace export → [`exports\u002Ftrain_engine_0.5B\u002FREADME.md`](.\u002Fexports\u002Ftrain_engine_0.5B\u002FREADME.md) · [`exports\u002Ftrain_engine_8b\u002FREADME.md`](.\u002Fexports\u002Ftrain_engine_8b\u002FREADME.md)\n\n---\n\n## Core Technology\n\n### `harness\u002F` — the scaffolding that lets an Agent produce a training framework *(coming soon)*\n\n- **Zero-touch Agent Loop** — `bash agent-loop.sh` runs a Coding Agent in a loop against the prompt files, with no human in the loop\n- **Two-stage gate-driven convergence**:\n  - **Stage 1 (M1-M6)** — bitwise forward \u002F backward alignment → DP=8 multi-step training → long-train statistical gate (loss rel diff \u003C 1%, MFU(standard) ≥ 36%)\n  - **Stage 2** — per-operator CUDA kernel optimization, 30 rounds per op with best-MFU election, with a DP=8 long-train integration gate on every merge\n- **Portable to any reference training stack** — Megatron-LM v0.15 \u002F torch in this reproduction, but any working stack (DeepSpeed, custom, …) honoring the same env contract drops in unchanged\n\n### `exports\u002Ftrain_engine_0.5B\u002F` — the training framework the Agent wrote (0.5B)\n\n- **CUDA Graph × 5 capture granularities** — forward \u002F step \u002F step_full \u002F step_optimizer \u002F step_nccl_opt, freely composable with BucketedGradReducer \u002F sharded optimizer \u002F wgrad-overlap\n- **Triton fused kernels** — one kernel each for CE fwd+bwd \u002F SwiGLU \u002F RMSNorm+residual \u002F RoPE \u002F fused Adam+param sync\n- **Self-explored optimization space** — the Agent enumerated and benchmarked CuTeDSL \u002F cuBLAS \u002F Triton \u002F TransformerEngine operator variants and dozens of comm + CUDA-Graph capture combinations in real distributed jobs, scoring both MFU and loss alignment; production defaults are the optimum the Agent picked\n\n### `exports\u002Ftrain_engine_8b\u002F` — the training framework the Agent wrote (8B)\n\n- **MiniCPM4-8B on a single 8× H100 host** — `tensor_model_parallel_size=2`, **50.9% MFU**, ~8% above the Megatron-LM baseline\n\n---\n\n## Performance\n\n| Metric | ForgeTrain | Megatron-LM (baseline) |\n|---|---|---|\n| **MFU** | **44.13%** | ~40% |\n| **MFU improvement** | **+10%** | — |\n\n> Test conditions: MiniCPM4-0.5B · 64× H100 · BF16 · DP-only.\n\n> Full performance guide → [`exports\u002Ftrain_engine_0.5B\u002FREADME.md`](.\u002Fexports\u002Ftrain_engine_0.5B\u002FREADME.md) · [`exports\u002Ftrain_engine_8b\u002FREADME.md`](.\u002Fexports\u002Ftrain_engine_8b\u002FREADME.md)\n\n---\n\n## Contributing\n\nContributions are welcome! Here are some ways to help:\n\n- 🐛 **Bug reports** — [open an issue](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FForgeTrain\u002Fissues)\n- 💡 **Feature requests** — [open an issue](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FForgeTrain\u002Fissues) with `[feature]` in the title\n- 📝 **Reproducibility reports** — share your experience reproducing the Agent Loop on different setups\n- 🔧 **Pull requests** — code improvements, documentation fixes, and new model support\n\n---\n\n## License\n\nLicensed under the [Apache License 2.0](.\u002FLICENSE).\n\nThe vendored `exports\u002Ftrain_engine_0.5B\u002Fsrc\u002Fquack\u002F` and `exports\u002Ftrain_engine_8b\u002Fsrc\u002Fquack\u002F` snapshots retain their upstream copyright headers; see [`exports\u002Ftrain_engine_0.5B\u002Fsrc\u002Fquack\u002FNOTICE.md`](.\u002Fexports\u002Ftrain_engine_0.5B\u002Fsrc\u002Fquack\u002FNOTICE.md) and [`exports\u002Ftrain_engine_8b\u002Fsrc\u002Fquack\u002FNOTICE.md`](.\u002Fexports\u002Ftrain_engine_8b\u002Fsrc\u002Fquack\u002FNOTICE.md) for provenance. Built on a reference training stack (Megatron-LM v0.15 in this reproduction) and the Cursor Coding Agent; data and tokenizers follow MiniCPM4-0.5B \u002F MiniCPM4-8B upstream.\n\n---\n\n## Acknowledgments\n\nForgeTrain builds on the work of several outstanding open-source projects:\n\n- [CUTLASS](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass) \u002F [CuTeDSL](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass) — CuTeDSL GEMMs and helper utilities\n- [FlashAttention-4](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) (Dao-AILab) — FA4 CuTeDSL SM90 attention kernels\n- [TransformerEngine](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTransformerEngine) (NVIDIA) — reference operator implementations\n- [Megatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) (NVIDIA) — reference training stack for gate verification\n- [Cursor](https:\u002F\u002Fcursor.com) — the Coding Agent that authored the engine\n- [MiniCPM4](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM) (OpenBMB) — target model architecture and tokenizer\n\n---\n\n## Citation\n\nIf you find this project useful, please consider citing:\n\n```bibtex\n@software{forgetrain_2026,\n  title   = {ForgeTrain: An LLM Pretraining Framework Built End-to-End by an Autonomous Agent Loop},\n  year    = {2026},\n  url     = {https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FForgeTrain}\n}\n```\n\n---\n\n## Hardware \u002F Software Baseline\n\n| Item | Requirement |\n|---|---|\n| GPU | NVIDIA H100 80GB (SM90, Hopper) |\n| GPU count | 8× H100 for full gates \u002F pretraining; 1× H100 for early alignment |\n| CUDA | 12.x |\n| PyTorch | ≥ 2.4 |\n| Python | ≥ 3.11 |\n| Validated scope | MiniCPM4-0.5B (DP-only) \u002F MiniCPM4-8B (TP=2) × H100 × BF16 |\n","ForgeTrain 是一个由AI自主循环代理构建的大型语言模型预训练框架。其核心功能包括完全由AI代理自动生成代码，无需人工编辑，并且在NVIDIA H100上实现了44.13%的MFU（比Megatron-LM基准高出约10%）。该项目通过自我诊断循环来实现从读取参考到提交代码的全流程自动化，同时支持GEMM和注意力机制内核的高效实现。适用于需要高性能、高效率的大规模语言模型预训练场景，尤其是在追求极致硬件利用率和生产验证可靠性的情况下。","2026-06-11 03:58:07","CREATED_QUERY"]