[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1860":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},1860,"LoongForge","baidu-baige\u002FLoongForge","baidu-baige","A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.","https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Findex.html",null,"Python",273,27,3,9,0,31,90,119,93,94.34,"Apache License 2.0",false,"master",true,[27,28,29,30,31,32,33,34],"ai","diffusion","infra","llm","training","vla","vlm","wan","2026-06-12 04:00:11","\u003Cdiv align=\"center\">\n\n**English** | [中文](.\u002FREADME_zh.md)\n\n\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\"  srcset=\".\u002Fdocs\u002Fassets\u002Fimages\u002Flogo\u002Fbanner-dark.svg\">\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\".\u002Fdocs\u002Fassets\u002Fimages\u002Flogo\u002Fbanner.svg\">\n    \u003Cimg alt=\"LoongForge\" src=\".\u002Fdocs\u002Fassets\u002Fimages\u002Flogo\u002Fbanner.svg\" width=\"520\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Ch4>A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n\n[![Docs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocs-Latest-blue?style=for-the-badge&logo=readthedocs)](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Findex.html)\n[![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Visit-FF6B35.svg?style=for-the-badge)](https:\u002F\u002Fbaidu-baige.github.io\u002FLoongForge\u002Fblog\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fbaidu-baige\u002FLoongForge.svg?style=for-the-badge&logo=github)](https:\u002F\u002Fgithub.com\u002Fbaidu-baige\u002FLoongForge\u002Fblob\u002Fmaster\u002FLICENSE)\n[![Slack](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSlack-Join-4A154B.svg?style=for-the-badge&logo=slack)](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fbaiduloongforge\u002Fshared_invite\u002Fzt-3ys3kaq2p-cmdw0nDoaHGOcKibgys5Yw)\n[![WeChat](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-Join-07C160.svg?style=for-the-badge&logo=wechat)](#contact)\n\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>🚀 Up to \u003Ca href=\"#benchmark\">5.04× training speedup\u003C\u002Fa>\u003C\u002Fb> &nbsp;·&nbsp;\n  \u003Cb>🌐 Native NVIDIA GPU & Kunlun XPU support\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n## 💡 Why LoongForge?\n\n> 🐉 LoongForge is part of Baidu Baige's **Loong** open-source series — named after the traditional Chinese **loong boat (龙舟)**, a symbol of coordinated power and forward momentum.\n\n**LoongForge** is a unified training framework for **LLMs, VLMs, VLAs, and diffusion models**, covering **pre-training**, **continued pre-training**, and **SFT**. Built upon Megatron-LM with deep systemic enhancements across **model coverage**, **training performance**, and **hardware support**, it delivers **significant speedups over mainstream open-source baselines**.\n\nBefore going open-source, LoongForge was developed as **AIAK-Training-LLM**, Baidu Baige's training acceleration stack. It has supported production training for enterprise customers across **Education**, **Computer Vision**, and **Embodied AI**, typically delivering **30%~50% speedup over customer baselines**, with the largest production runs reaching **5,000+ XPUs**.\n\n## 🔥 Latest News\n\n- **[2026\u002F05]** ⚡ Accelerated **Wan 2.2** training by **116%**, and added CP and data packing support.\n- **[2026\u002F05]** ✨ Added training support for **Kimi K2.5 \u002F K2.6**, and introduced **INT4 \u002F NVFP4** PTQ.\n- **[2026\u002F05]** 🎉 **v0.1.0** — first official tagged release of LoongForge.\n- **[2026\u002F05]** 🌟 Powered the training and public release of **LLaVA-OneVision-2.0**.\n- **[2026\u002F05]** 🤖 Expanded VLA coverage with **GR00T N1.6**; **60%+ speedup** on Pi0.5 and GR00T training.\n- **[2026\u002F04]** 🧩 Added training support for **MiniMax-M2.7** on both NVIDIA GPU and Kunlun XPU.\n- **[2026\u002F04]** 🚀 LoongForge source code publicly available on GitHub. [[blog]](https:\u002F\u002Fbaidu-baige.github.io\u002FLoongForge\u002Fblog\u002F2026-04-announcing-loongforge.html)\n- **[2025\u002F10]** 🌟 Powered the training and public release of **LLaVA-OneVision-1.5** under **AIAK-Training-LLM**, the predecessor of LoongForge. [[blog]](https:\u002F\u002Fbaidu-baige.github.io\u002FLoongForge\u002Fblog\u002F2025-10-llava-onevision-case-study.html)\n\n## ⚡ Quick Start\n\nSee the full documentation for installation, tutorials, and advanced usage — [English](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Findex.html) · [中文](https:\u002F\u002Floongforge.readthedocs.io\u002Fzh-cn\u002Flatest\u002Findex.html).\n\n**1. Install** — via [**Docker**](.\u002Fdocker) (*prebuilt images coming soon*) or **source build**:\n- **NVIDIA GPU**: [Installation Guide](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Finstallation.html)\n- **Kunlun XPU**: [Installation Guide](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fkunlun_tutorial\u002Finstall_p800.html)\n\n**2. Launch your first training run** — follow a tutorial for your target hardware and modality:\n- **NVIDIA GPU**: [LLM](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fllm_tutorial\u002Fquick_start_llm_pretrain.html) · [VLM](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fvlm_tutorial\u002Fquick_start_vlm_pretrain.html) · [VLA](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fvla_tutorial\u002Fquick_start_pi05_training.html) · [Diffusion (WAN)](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fwan_tutorial\u002Fquick_start_wan_training.html)\n- **Kunlun XPU**: [Kunlun XPU Tutorials](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fkunlun_tutorial\u002FREADME.html)\n\n**3. Explore** — browse [`configs\u002Fmodels\u002F`](.\u002Fconfigs\u002Fmodels) and [`examples\u002F`](.\u002Fexamples) \u002F [`examples_xpu\u002F`](.\u002Fexamples_xpu) for ready-to-run scripts.\n\n## ✨ Key Features\n\n* **🧩 Flexible Multi-Modal Composition** — Configuration-driven assembly of VLMs from interchangeable ViT and LLM components.\n* **⚡ Heterogeneous Parallelism** — Independent TP \u002F DP \u002F recompute per model component (e.g., ViT vs. LLM) for optimal throughput and memory. [[blog](https:\u002F\u002Fbaidu-baige.github.io\u002FLoongForge\u002Fblog\u002F2026-05-loongforge-heterogeneous-parallel-training.html)]\n* **🔀 Decoupled Encoder-Decoder Training** — Separates ViT and LLM into independent tasks, eliminating encoder-induced pipeline bubbles.\n* **⚖️ DP Load Balancing** — Load-aware data redistribution mitigates sequence-packing imbalance, improving multi-node scaling efficiency. [[blog](https:\u002F\u002Fbaidu-baige.github.io\u002FLoongForge\u002Fblog\u002F2026-05-loongforge-dp-load-balancing.html)]\n* **🚀 MoE-Native Optimization** — Overlapped All2All \u002F activation offload \u002F compute, with **further memory reduction** beyond upstream Megatron-LM on DeepSeek-V3, Qwen3-MoE, etc.\n* **🔬 Adaptive FP8 Training** — End-to-end FP8 for LLMs and VLMs with standard **blockwise FP8**; optional **adaptive** mode picks per-operator precision by GEMM shape and efficiency.\n* **🔧 Custom Fused Operators** — Fused kernels like **FusedDSA** for DSA-style models — TileLang version open-sourced, high-performance CUDA version available on Baidu Baige platform.\n* **🔁 Flexible Checkpointing** — Offline bidirectional **Megatron ↔ HuggingFace** conversion plus native online HF load\u002Fsave — no format barriers across your workflow.\n* **🧰 Versatile Pipelines & Data Tools** — Out-of-the-box **Pretrain \u002F MidTrain \u002F SFT \u002F LoRA**, with built-in dataset format conversion and sequence packing.\n* **🌐 Heterogeneous Hardware** — Native support for **NVIDIA GPUs** and **Kunlun XPUs** via a minimally-intrusive plugin design.\n\n> 📖 Deep-dive: [LLM features](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fllm_tutorial\u002Ffeatures_index.html) · [VLM features](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fvlm_tutorial\u002Ffeatures_index.html)\n\n\u003Ca id=\"benchmark\">\u003C\u002Fa>\n## 📊 Benchmark\n\nMeasured on **v0.1.1** across LLM, VLM, VLA and DIT workloads against mainstream open-source training baselines:\n\n| Model | Type | Baseline | Configuration | Speedup |\n|---|---|---|---|---|\n| Qwen3-30B-A3B | MoE | Megatron-LM\u003Csup>†\u003C\u002Fsup> | 32 × A800\u003Csup>‡\u003C\u002Fsup> · GBS 1024 · 32K | **1.16×** |\n| DeepSeek-V3.2 Lite \u003Csup>§\u003C\u002Fsup> | MoE + DSA | Megatron-LM\u003Csup>†\u003C\u002Fsup> | Reduced-layer · GBS 128 · 8K | **5.04×** |\n| Qwen3-VL-30B-A3B | VLM | VeOmni\u003Csup>†\u003C\u002Fsup> | 32 × A800\u003Csup>‡\u003C\u002Fsup> · GBS 128 · 32K | **1.45×** |\n| GR00T N1.6 | VLA | LeRobot\u003Csup>†\u003C\u002Fsup> | 8 × A800\u003Csup>‡\u003C\u002Fsup> · GBS 128 · 224×224 | **2.31×** |\n| Pi0.5 | VLA | OpenPI\u003Csup>†\u003C\u002Fsup> | 8 × A800\u003Csup>‡\u003C\u002Fsup> · GBS 112 · 224×224 | **1.65×** |\n| Wan2.2 | DIT | DiffSynth\u003Csup>†\u003C\u002Fsup> | 8 × A800\u003Csup>‡\u003C\u002Fsup> · 480×832x49 | **2.16×** |\n\n> \u003Csup>§\u003C\u002Fsup> Due to test-bed scale limits, **DeepSeek-V3.2** was validated separately on a reduced-layer configuration — LoongForge's **DSA CUDA kernel optimizations** still deliver **~5× speedup** over Megatron-LM and reach **64K sequence** (baseline OOMs beyond 8K).\u003Cbr>\n> \u003Csup>†\u003C\u002Fsup> Numbers reflect baseline and LoongForge versions at the time of measurement, and may evolve as implementations change.\u003Cbr>\n> \u003Csup>‡\u003C\u002Fsup> Validation on additional hardware is rolling out in upcoming releases.\u003Cbr>\n\n\n## 🌟 Powered by LoongForge\n\n- [LLaVA-OneVision-2.0](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLLaVA-OneVision-2) — Next-generation multimodal model, with new VideoCaption and Spatial datasets.\n- [LLaVA-OneVision-1.5](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.23661) — Fully open framework for democratized multimodal training.\n- [Qianfan-VL](https:\u002F\u002Fgithub.com\u002Fbaidubce\u002FQianfan-VL) — Domain-Enhanced Vision-Language Models for Enterprise, 3B to 70B parameters.\n\n## 🏛️ Supported Models\n\nLoongForge supports a broad range of [state-of-the-art models](https:\u002F\u002Floongforge.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fsupport_model.html) across LLM, VLM, diffusion, and VLA.\n\n| **Modality** | **Architectures** | **Models** |\n|---------------|------------------|------------|\n| **LLM** | DeepSeek-V2 | deepseek-v2-lite, deepseek-v2 |\n| | DeepSeek-V3 | deepseek-v3, deepseek-v32 |\n| | LLaMA2 | llama2-7b, llama2-13b, llama2-70b |\n| | LLaMA3 | llama3-8b, llama3-70b |\n| | LLaMA3.1 | llama3.1-8b, llama3.1-70b, llama3.1-405b |\n| | Qwen | qwen-1.8b → qwen-72b |\n| | Qwen1.5 | qwen1.5-0.5b → qwen1.5-72b |\n| | Qwen2 | qwen2-0.5b → qwen2-72b |\n| | Qwen2.5 | qwen2.5-0.5b → qwen2.5-72b |\n| | Qwen3 | qwen3-0.6b → qwen3-480b-a35b, qwen3-coder-30b-a3b |\n| | Qwen3-Next | qwen3-next-80b-a3b |\n| | MiniMax | minimax-m2.1, minimax-m2.5, minimax-m2.7 |\n| | MIMO | mimo-7b |\n| | GLM | glm5 |\n| **VLM** | Qwen2.5-VL | qwen2.5-vl-3b → qwen2.5-vl-72b |\n| | Qwen3-VL | qwen3-vl-30b-a3b, qwen3-vl-235b-a22b |\n| | Qwen3.5 | qwen3.5-0.8b → qwen3.5-397b-a17b |\n| | Qwen3.6 | qwen3.6-27b, qwen3.6-35b-a3b |\n| | Kimi-K2.5 | kimi-k2.5, kimi-k2.6 |\n| | ERNIE4.5-VL | ernie4.5vl-28b-a3b |\n| | LLaVA-OneVision-1.5 | llava-onevision-1.5-4b |\n| | InternVL2.5 | internvl2.5-8b → internvl2.5-78b |\n| | InternVL3.5 | internvl3.5-8b → internvl3.5-241b-a28b |\n| | CustomCombinedModel | Flexible ViT + LLM backbone configuration ([example](https:\u002F\u002Fgithub.com\u002Fbaidu-baige\u002FLoongForge\u002Fblob\u002Fmaster\u002Fconfigs\u002Fmodels\u002Fcustom\u002Fqwen_vit_llama3_8b.yaml)) |\n| **Diffusion** | WAN2.2 | wan2.2_i2v_a14b |\n| **VLA** | Pi | pi0.5 |\n| | GR00T | groot-n1.6 |\n\n\n## 🚀 Roadmap\n\n**Model Support**\n- LLM \u002F VLM: ongoing validation and release of new models (e.g., DeepSeek-V4)\n- Embodied AI: expanded WAM coverage (e.g., DreamZero, LingBot VA)\n\n**Performance & Scaling**\n- Adopt next-generation techniques introduced with DeepSeek-V4\n- Advanced MoE load-balancing strategies\n- Long-sequence training with ChunkPipe scheduling and Context Parallelism\n- Further diffusion-model acceleration (e.g., WAN)\n- INT4 quantization-aware training\n- MTP (Multi-Token Prediction) scaling for speculative decoding\n\n## 🏗️ Repository Layout\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>📁 Directory tree\u003C\u002Fb>\u003C\u002Fsummary>\n\n```\nLoongForge\u002F\n├── loongforge\u002F                   # Core training framework\n│   ├── train\u002F                    # Training entry points & trainers\n│   │   ├── pretrain\u002F             #   Pretrain (LLM, VLM)\n│   │   ├── sft\u002F                  #   SFT (LLM, VLM, InternVL, ERNIE)\n│   │   ├── diffusion\u002F            #   Diffusion (WAN)\n│   │   └── embodied\u002F             #   Embodied AI (Pi0.5, GR00T)\n│   ├── models\u002F                   # Unified model abstractions\n│   │   ├── foundation\u002F           #   LLM backbones (LLaMA, Qwen, DeepSeek, ...)\n│   │   ├── encoder\u002F              #   Vision encoders (ViT, Qwen-VL, InternVL, ...)\n│   │   ├── omni_models\u002F          #   Multi-modal composition\n│   │   ├── diffusion\u002F            #   Diffusion models (WAN)\n│   │   ├── embodied\u002F             #   Embodied models (Pi0.5, GR00T)\n│   │   └── common\u002F               #   Shared layers and utilities\n│   ├── data\u002F                     # Data pipelines (multi-modal, video, DP balance)\n│   ├── tokenizer\u002F                # Tokenizers\n│   └── utils\u002F                    # Config map, constants, etc.\n├── third_party\u002FLoong-Megatron\u002F   # Patched Megatron-LM (git submodule)\n├── configs\u002F                      # Hydra YAML configs (models, data)\n├── examples\u002F                     # GPU launch scripts\n├── examples_xpu\u002F                 # Kunlun XPU launch scripts\n├── tools\u002F                        # Checkpoint conversion, data preprocessing\n├── ops\u002F                          # Custom fused operators (incl. open-sourced TileLang)\n├── patches\u002F                      # TransformerEngine patches\n├── docker\u002F                       # Dockerfiles (GPU & XPU)\n├── tests\u002F                        # E2E test suite (YAML-driven)\n└── docs\u002F                         # Documentation\n```\n\n\u003C\u002Fdetails>\n\n## 🤝 Contributing\n\nWe warmly welcome community contributions — bug reports, feature proposals, and PRs alike. Please read our [Contributing Guidelines](https:\u002F\u002Fgithub.com\u002Fbaidu-baige\u002FLoongForge\u002Fblob\u002Fmaster\u002FCONTRIBUTING.md) before submitting.\n\n## 📄 License\n\nLoongForge is released under the [Apache License 2.0](https:\u002F\u002Fgithub.com\u002Fbaidu-baige\u002FLoongForge\u002Fblob\u002Fmaster\u002FLICENSE). Some files are derived from third-party open-source projects; please refer to the specific file headers for their respective copyright and attribution.\n\n## 📝 Citation\n\n```bibtex\n@software{LoongForge2026,\n  title  = {LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models},\n  author = {{The LoongForge Authors}},\n  year   = {2026},\n  url    = {https:\u002F\u002Fgithub.com\u002Fbaidu-baige\u002FLoongForge}\n}\n```\n\n## 🙏 Acknowledgments\n\nLoongForge is built upon NVIDIA's Megatron-LM. We also drew inspiration from several excellent open-source projects, including but not limited to HuggingFace Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions.\n\n## 💬 Contact\n\u003Ca id=\"contact\">\u003C\u002Fa>\n\nOpen a GitHub issue for questions, feedback, or feature requests. You can also [join our Slack community](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fbaiduloongforge\u002Fshared_invite\u002Fzt-3ys3kaq2p-cmdw0nDoaHGOcKibgys5Yw) or scan the WeChat QR code below to join our developer community.\n\n\u003Cimg width=\"377\" alt=\"LoongForge WeChat Community\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F4c69c950-f2e7-4b5e-bc9a-ffe0ebf09760\" \u002F>\n\n\n\n\n","LoongForge 是一个模块化、可扩展且高性能的训练框架，专为大语言模型（LLMs）、视觉-语言模型（VLMs）、扩散模型和具身智能模型设计。其核心功能包括支持预训练、继续预训练及微调，并在模型覆盖范围、训练性能以及硬件支持方面进行了深度优化，尤其对NVIDIA GPU与昆仑XPU提供原生支持。据称相比主流开源基线，该框架能够实现高达5.04倍的训练加速。LoongForge适用于教育、计算机视觉及具身人工智能等多个领域的实际生产环境中的高效训练任务。",2,"2026-06-11 02:46:28","CREATED_QUERY"]