[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80696":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":5,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":14,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":11,"lastSyncTime":24,"discoverSource":25},80696,"OSP-Next","PKU-YuanGroup\u002FOSP-Next","PKU-YuanGroup",null,"Python",60,2,46,0,1,14,43.33,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:29","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"assets\u002Flogo.png\" alt=\"OSP-Next\" width=\"220\">\n\n### Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning\n\n**Open-Sora Plan · Next Generation**\n\nA scalable **sparse** text-to-video diffusion model, introducing **Skiparse-2D Attention**,\n**Sparse Sequence Parallelism (SSP)**, **HiF8 quantization**, and\n**Mix-GRPO + LoRA** RL post-training.\n\n\u003C\u002Fdiv>\n\n\u003Ch5 align=\"center\">\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-OSP--Next-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.28691)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-Open--Sora%20Plan-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00131)\n[![HuggingFace](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-HuggingFace-FFD21E.svg)](https:\u002F\u002Fhuggingface.co\u002Fyunyangge\u002FOSP-Next)\n[![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-OSP--Next-624AFF.svg?logo=alibabacloud)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fbeihai123\u002FOSP-Next)\n[![GitHub repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FOSP-Next?style=flat&logo=github&logoColor=whitesmoke&label=Stars)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOSP-Next\u002Fstargazers)\n\n\u003C\u002Fh5>\n\n---\n\n## 📣 News\n\n- **[2026.05.22]** 🎉🎉🎉 We have open-sourced the **complete training & inference code** for OSP-Next together with the **model weights**. Welcome to give it a try!\n\n---\n\n## ✨ Highlights\n\nOSP-Next is a **sparse video diffusion** framework with four tightly co-designed\ncontributions — see the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.28691) for the full technical report.\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"50%\" valign=\"top\">\n\u003Ch3>🧩 &nbsp;Skiparse-2D Attention\u003C\u002Fh3>\n\nA **fixed-rule sparse attention pattern** purpose-built for image \u002F video\nmodalities, applied independently along height and width. Better aligns with\nspatial locality than Skiparse-1D, **approaches the quality of 3D Full\nAttention**, and stays **natively compatible with FlashAttention kernels** —\nno custom triton or CUDA needed.\n\n\u003C\u002Ftd>\n\u003Ctd width=\"50%\" valign=\"top\">\n\u003Ch3>🔗 &nbsp;Sparse Sequence Parallelism (SSP)\u003C\u002Fh3>\n\nA **parallel strategy natively co-designed with Skiparse-2D Attention**.\nCompared to Ulysses SP, SSP cuts **inter-rank communication volume by 75%** and\ndrops the per-block communication steps **from 4 down to 1** — removing the SP\nbottleneck for long-video, long-context training.\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd width=\"50%\" valign=\"top\">\n\u003Ch3>🪶 &nbsp;HiF8 Quantization &nbsp;\u003Csub>(NPU only)\u003C\u002Fsub>\u003C\u002Fh3>\n\nA **dynamic-precision HiF8** scheme (per-tensor exponent \u002F mantissa allocation)\napplied on top of the sparse model. The **first work to show that 8-bit\nquantization and sparse-model fine-tuning can be done jointly** — the VBench gap stays within ~0.5% with baseline, and inference reaches up to **2.27× speed-up** on a single Ascend 950PR.\n\n\u003C\u002Ftd>\n\u003Ctd width=\"50%\" valign=\"top\">\n\u003Ch3>🎯 &nbsp;Mix-GRPO RL on Sparse Models\u003C\u002Fh3>\n\nThe **first attempt to apply reinforcement learning to a sparse video\ngeneration model**. Our **Mix-GRPO + LoRA** pipeline shows that RL keeps\npushing the quality \u002F preference frontier of sparse models — and the entire\nsparse-model training pipeline is open-sourced for the community.\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### 📊 Performance at a glance\n\nEnd-to-end speed-ups vs. the **Wan2.1** full-attention baseline, measured on\n**5-second · 81-frame** videos at two resolution settings (Tab. 2 in the paper):\n\n\u003Ctable>\n\u003Ctr>\n\u003Cth width=\"33%\" align=\"center\">⚡ NVIDIA H200\u003Cbr\u002F>\u003Csub>OSP-Next · BF16 · FA3 + torch.compile\u003C\u002Fsub>\u003C\u002Fth>\n\u003Cth width=\"33%\" align=\"center\">🟣 Ascend 950PR\u003Cbr\u002F>\u003Csub>OSP-Next · BF16 · SDPA\u003C\u002Fsub>\u003C\u002Fth>\n\u003Cth width=\"33%\" align=\"center\">🪶 Ascend 950PR\u003Cbr\u002F>\u003Csub>OSP-Next-HiF8 · 8-bit · SDPA\u003C\u002Fsub>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd valign=\"top\">\n\n- 720P (padded)\n  - **1.53×** single-GPU\n  - **1.42×** on 8× GPU\n- 768P (native)\n  - **1.64×** single-GPU\n  - **1.52×** on 8× GPU\n\n\u003C\u002Ftd>\n\u003Ctd valign=\"top\">\n\n- 720P (padded)\n  - **1.27×** single-NPU\n- 768P (native)\n  - **1.76×** single-NPU\n\n\u003C\u002Ftd>\n\u003Ctd valign=\"top\">\n\n- 720P (padded)\n  - **1.69×** single-NPU\n- 768P (native)\n  - **2.27×** single-NPU\n- Quality cost\n  - **only −0.4 pt** VBench vs BF16\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n> 🏆 &nbsp;OSP-Next hits a **VBench total of 83.73%** (Wan2.1 baseline: 83.69%);\n> OSP-Next-HiF8 keeps **83.29%** with only a 0.4-pt drop. Full benchmark tables,\n> ablations and qualitative comparisons are in the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.28691).\n\n> ℹ️ &nbsp;Multi-NPU 950PR numbers are not reported yet — Ascend 950PR\n> resources are currently in limited supply, so the results for this hardware\n> are restricted to a single NPU.\n\n> 🟦 &nbsp;**Bonus** — one codebase, two backends: the same training & inference\n> scripts run on **NVIDIA CUDA** *and* **Ascend NPU** — just swap\n> `pip install -e .` for `pip install -e .[npu]`.\n\n---\n\n## 🚀 Quick Start\n\nGenerate your first OSP-Next video in four commands (GPU example):\n\n```bash\n# 1. Clone & install\ngit clone https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOSP-Next.git && cd OSP-Next\nconda create -n ospnext python=3.10 -y && conda activate ospnext\npip install -e .\n\n# 2a. Download the OSP-Next 14B diffusion weights from our repo.\nhuggingface-cli download yunyangge\u002FOSP-Next --local-dir .\u002Fcheckpoints\u002Fosp_next_14b\n\n# 2b. OSP-Next reuses Wan 2.1's T5 text encoder and VAE — we do NOT re-host\n#     them. Grab them from the upstream Wan-AI repo (HuggingFace or ModelScope):\nhuggingface-cli download Wan-AI\u002FWan2.1-T2V-14B \\\n    models_t5_umt5-xxl-enc-bf16.pth \\\n    Wan2.1_VAE.pth \\\n    --include \"google\u002Fumt5-xxl\u002F*\" \\\n    --local-dir .\u002Fcheckpoints\u002FWan2.1-T2V-14B\n\n# 3. Edit one config file — point `pretrained_model_dir_or_checkpoint`,\n#    `vae_path`, `checkpoint_path` and `text_tokenizer_path` to the directories\n#    you just downloaded:\n$EDITOR configs\u002Finfer\u002Fgpu\u002Fosp_14b.yaml\n\n# 4. Generate!\nbash scripts\u002Finfer\u002Fgpu\u002Finfer_osp_14b.sh\n```\n\n> ⏱️  First run takes a few minutes to warm up FSDP2 + compile the kernels;\n> subsequent prompts in the same process are much faster.\n\n> 🟣 **On Ascend NPU?** Skip step 1 and follow the\n> [🟣 NPU (Ascend)](#-npu-ascend) setup first (CANN 8.5.0, `pip install -e .[npu]`,\n> source-build `decord`), then come back to steps 2–4 and swap the GPU script\n> in step 4 for its NPU equivalent under `scripts\u002Finfer\u002Fnpu\u002F`.\n\n---\n\n## 🎞️ Demo Gallery\n\nA side-by-side comparison of the same prompt across three models. Hit ▶ on any\ncell to play the video right inside the page.\n\n\u003C!--\n  TODO: replace each \u003CDEMO_*> placeholder with a GitHub-hosted mp4 URL, e.g.\n        https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F\u003Cuuid>\n  GitHub renders \u003Cvideo src=\"...\"> inline as long as the URL lives on a\n  github.com \u002F user-attachments \u002F githubusercontent.com domain.\n-->\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth align=\"center\">Prompt\u003C\u002Fth>\n      \u003Cth align=\"center\">Wan 2.1\u003C\u002Fth>\n      \u003Cth align=\"center\">OSP-Next\u003C\u002Fth>\n      \u003Cth align=\"center\">OSP-Next-HiF8\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd>\u003Csub>\u003Ci>\"A handheld 35mm camera holds an extreme close-up on a gray-haired, bearded man in his sixties...\"\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd4b53b9d-bbe3-4f38-9e44-332662a53318\"     width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F4220f39e-95dc-443e-92fb-74c214fa56a9\"       width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ff7343d6b-94a2-45dd-9d84-c5da748371d4\"  width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Csub>\u003Ci>\"A cream and sable corgi, sporting sleek jet-black sunglasses, trots confidently along a pristine tropical beach...\"\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd974d8e6-2f0d-4f91-9a53-d69adea9868f\"     width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F898b43b6-f17b-491e-8880-99370b1b88de\"       width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fc813726b-1815-4eba-a4af-ac4b77b44393\"  width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Csub>\u003Ci>\"A lone 30-year-old space man strides across an endless salt desert under a vast, electric-blue sky...\"\u003C\u002Fi>\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fc1b435a0-b688-4d41-ab9c-5d226c683311\"     width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F38647f5c-975b-4d17-a5e0-a9410f4133bf\"       width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F11ea6d17-534b-40da-85fe-964c7602bad8\"  width=\"240\" controls muted loop playsinline preload=\"metadata\">\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n> 💡 **HiF8 takeaway** — On a single **Ascend 950PR**, OSP-Next-HiF8 reaches\n> **1.69× \u002F 2.27× speed-up** over the BF16 baseline under the 5s 720P \u002F 5s 768P\n> settings, with only a **0.4 point** drop on the VBench total score.\n\n---\n\n## 📦 Model Downloads\n\n### 🧠 OSP-Next diffusion weights (hosted by us)\n\n| Model             | Params | 🤗 HuggingFace                       | \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fmodelscope.png?size=48\" height=\"14\" alt=\"ModelScope\" valign=\"middle\"> ModelScope |\n|-------------------|-------:|--------------------------------------|----------------------------------------------------------------------|\n| OSP-Next 14B      |  14B   | [`yunyangge\u002FOSP-Next`](https:\u002F\u002Fhuggingface.co\u002Fyunyangge\u002FOSP-Next) | [`beihai123\u002FOSP-Next`](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fbeihai123\u002FOSP-Next) |\n| OSP-Next-HiF8 14B |  14B   | [`yunyangge\u002FOSP-Next`](https:\u002F\u002Fhuggingface.co\u002Fyunyangge\u002FOSP-Next)  | [`beihai123\u002FOSP-Next`](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fbeihai123\u002FOSP-Next)  |\n\n> ℹ️  The `*_1_3b.yaml` configs and `*_1_3b.sh` launch scripts are kept in the\n> repository as ready-to-use templates if you want to train your own 1.3B\n> variant, but **no official 1.3B checkpoint is released** at this time.\n\n### 🔡 T5 text encoder & 🎞️ WAN VAE (hosted by Wan-AI)\n\nOSP-Next reuses the **T5 text encoder** and **WAN VAE** released with Wan 2.1\nverbatim — we do **not** re-host these weights. Please grab them from the\nofficial Wan-AI repository:\n\n| Component                 | File                                       | 🤗 HuggingFace                                                                                                                              | \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fmodelscope.png?size=48\" height=\"14\" alt=\"ModelScope\" valign=\"middle\"> ModelScope                                                     |\n|---------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| T5 (UMT5-XXL) weights     | `models_t5_umt5-xxl-enc-bf16.pth`          | [`Wan-AI\u002FWan2.1-T2V-14B`](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.1-T2V-14B\u002Fblob\u002Fmain\u002Fmodels_t5_umt5-xxl-enc-bf16.pth)                           | [`Wan-AI\u002FWan2.1-T2V-14B`](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.1-T2V-14B\u002Ffile\u002Fview\u002Fmaster\u002Fmodels_t5_umt5-xxl-enc-bf16.pth)                                    |\n| T5 tokenizer              | `google\u002Fumt5-xxl\u002F`                         | [`Wan-AI\u002FWan2.1-T2V-14B`](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.1-T2V-14B\u002Ftree\u002Fmain\u002Fgoogle\u002Fumt5-xxl)                                           | [`Wan-AI\u002FWan2.1-T2V-14B`](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.1-T2V-14B\u002Ffiles)                                                                               |\n| WAN VAE                   | `Wan2.1_VAE.pth`                           | [`Wan-AI\u002FWan2.1-T2V-14B`](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.1-T2V-14B\u002Fblob\u002Fmain\u002FWan2.1_VAE.pth)                                            | [`Wan-AI\u002FWan2.1-T2V-14B`](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.1-T2V-14B\u002Ffile\u002Fview\u002Fmaster\u002FWan2.1_VAE.pth)                                                     |\n\n> 💡 All three components live inside the same `Wan-AI\u002FWan2.1-T2V-14B` repo, so\n> a one-shot download is enough — see the snippet in the [🚀 Quick Start](#-quick-start)\n> for an example. The same files also work for the 1.3B configs (they share the\n> identical T5 \u002F VAE backbone with the 14B model).\n\n### 📌 After downloading\n\nUpdate the corresponding paths in your config (both inference and training):\n\n```yaml\nmodel_config:\n  pretrained_model_dir_or_checkpoint: \"\u002Fpath\u002Fto\u002Fosp_next_14b\"                       # ← OSP-Next ckpt\nvae_config:\n  vae_path: \"\u002Fpath\u002Fto\u002FWan2.1-T2V-14B\u002FWan2.1_VAE.pth\"                                # ← WAN VAE\ntext_encoder_config:\n  checkpoint_path: \"\u002Fpath\u002Fto\u002FWan2.1-T2V-14B\u002Fmodels_t5_umt5-xxl-enc-bf16.pth\"        # ← T5 weights\n  text_tokenizer_path: \"\u002Fpath\u002Fto\u002FWan2.1-T2V-14B\u002Fgoogle\u002Fumt5-xxl\u002F\"                   # ← T5 tokenizer\n```\n\n---\n\n## 📑 Table of Contents\n\n- [📣 News](#-news)\n- [✨ Highlights](#-highlights)\n  - [📊 Performance at a glance](#-performance-at-a-glance)\n- [🚀 Quick Start](#-quick-start)\n- [🎞️ Demo Gallery](#️-demo-gallery)\n- [📦 Model Downloads](#-model-downloads)\n- [🧱 Project Layout](#-project-layout)\n- [⚙️ Environment Setup](#️-environment-setup)\n  - [🟢 GPU (NVIDIA CUDA)](#-gpu-nvidia-cuda)\n    - [1. Install OSP-Next](#1-install-osp-next)\n    - [2. *Optional* · Build Flash-Attention](#2-optional--build-flash-attention-hopper--ampere)\n  - [🟣 NPU (Ascend)](#-npu-ascend)\n    - [1. Install Ascend CANN 8.5.0](#1-install-ascend-cann-850-one-time-before-any-conda-step)\n    - [2. Set up the conda env and install OSP-Next](#2-set-up-the-conda-env-and-install-osp-next)\n    - [3. Build `decord` from source](#3-build-decord-from-source-aarch64-only)\n    - [4. *Optional* · Rebuild the HiF8 NPU Quant Kernel](#4-optional--rebuild-the-hif8-npu-quant-kernel-only-on-import-error)\n- [🎥 Inference Pipeline](#-inference-pipeline)\n- [🏋️ Training Pipeline](#️-training-pipeline)\n  - [📚 Data Preparation](#-data-preparation)\n  - [🎓 Supervised Fine-Tuning (SFT)](#-supervised-fine-tuning-sft)\n  - [🎯 Reinforcement Learning (Mix-GRPO + LoRA)](#-reinforcement-learning-mix-grpo--lora)\n- [🛠️ Tips & Troubleshooting](#️-tips--troubleshooting)\n- [🙏 Acknowledgements](#-acknowledgements)\n- [📝 Citation](#-citation)\n- [📄 License](#-license)\n- [⭐ Star History](#-star-history)\n\n---\n\n## 🧱 Project Layout\n\n```\nOSP-Next\u002F\n├── configs\u002F                     # All YAML configs\n│   ├── infer\u002F{gpu,npu}\u002F         # Inference configs (per backend)\n│   ├── train\u002F{gpu,npu}\u002F         # Training configs (per backend)\n│   ├── filter_config.yaml       # Data-filter \u002F LMDB-build settings\n│   └── all_videos.txt           # Example ann_txt index for filter_data.py\n├── scripts\u002F                     # Launch scripts (torchrun)\n│   ├── infer\u002F{gpu,npu}\u002F         # Inference launchers\n│   ├── train\u002F{gpu,npu}\u002F         # Training launchers\n│   └── filter_data.sh           # Wrapper: runs filter_data.py\n├── ospnext\u002F                     # Core library\n│   ├── modules\u002F                 # Diffusion \u002F VAE \u002F T5 \u002F attention \u002F HiF8\n│   ├── distributed\u002F             # FSDP2 + sequence-parallel state & comm\n│   ├── data\u002F                    # Datasets, samplers, collators\n│   ├── pipelines\u002F               # End-to-end inference pipelines\n│   ├── rewards\u002F                 # VideoAlign reward (for RL)\n│   ├── schedulers\u002F              # Flow matching scheduler\n│   ├── utils\u002F                   # Logging, EMA, checkpointing, encoder cache\n│   └── quant_cy_npu\u002F            # HiF8 quant op (NPU custom kernel)\n├── train\u002F\n│   ├── train_osp.py             # Entry: SFT training\n│   └── train_osp_RL.py          # Entry: Mix-GRPO + LoRA RL training\n├── infer\u002F\n│   └── infer_osp.py             # Entry: text-to-video inference\n├── merge_lora_weights.py        # Merge RL LoRA into base for deployment\n├── filter_data.py               # Entry: build LMDB from annotated video corpus\n├── assets\u002F\n│   ├── logo.png                 # README logo\n│   └── t2v\u002F                     # Sample prompt files\n├── requirements.txt             # GPU pip requirements\n├── requirements_npu.txt         # NPU pip requirements\n├── pyproject.toml               # Editable install metadata\n└── LICENSE.txt                  # Project license\n```\n\n---\n\n## ⚙️ Environment Setup\n\nWe strongly recommend using **conda + editable install** so that every entry\npoint (`train\u002F`, `infer\u002F`, custom scripts) sees the `ospnext` package\nautomatically.\n\n> 📦 The setup is split by backend — **pick one** and follow it top-to-bottom.\n> GPU users do not need anything from the NPU section and vice versa.\n\n---\n\n### 🟢 GPU (NVIDIA CUDA)\n\n#### 1. Install OSP-Next\n\n```bash\n# 1a. Create the conda env\nconda create -n ospnext python=3.10 -y\nconda activate ospnext\n\n# 1b. Install all dependencies in editable mode\ncd \u002Fpath\u002Fto\u002FOSP-Next\npip install -e .\n```\n\nWhat this installs:\n\n- `torch==2.8.0`, `torchvision==0.23.0` (CUDA build, picked by pip wheel)\n- `diffusers>=0.31`, `transformers>=4.55`, `accelerate>=1.4`, `peft>=0.10`, `trl>=0.11`\n- All data \u002F IO \u002F logging utilities listed in `pyproject.toml`\n\nEquivalent `pip -r` form:\n\n```bash\npip install -r requirements.txt\n```\n\n> ⚠️  `flash_attn` is **not** in `pyproject.toml` because building it via plain\n> `pip` is fragile. Build it manually only if you want FA2 \u002F FA3 acceleration —\n> see [step 2](#2-optional--build-flash-attention-hopper--ampere) right below.\n> Without it, the code falls back to PyTorch SDPA automatically.\n\n#### 2. *Optional* · Build Flash-Attention (Hopper \u002F Ampere)\n\nThe attention layer in `ospnext\u002Fmodules\u002Fattention.py` tries to import\n`flash_attn_interface` (FA3) first, falls back to `flash_attn` (FA2), and\nfinally to PyTorch SDPA — so this step is **strictly optional**.\n\n```bash\n# Flash-Attention v2 (CUDA 11.8+, Ampere \u002F Hopper)\npip install ninja packaging\npip install flash-attn --no-build-isolation\n\n# OR Flash-Attention v3 (Hopper-only, faster)\ngit clone https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\ncd flash-attention\u002Fhopper\npython setup.py install\n```\n\n> ⏳  **Heads up — this build is slow.** Compiling Flash-Attention from source\n> typically takes **30 min – 2 h** depending on CPU \u002F RAM (each CUDA kernel is\n> instantiated for many head-dim × dtype × causal combinations). Run it inside\n> `tmux` \u002F `screen` so an SSH disconnect doesn't kill it, and don't be alarmed\n> if `pip` looks \"stuck\" — it's just `nvcc` working its way through hundreds of\n> translation units. The wheel is cached afterwards, so subsequent reinstalls\n> in the same environment are instant.\n\n> 💡  If the build keeps OOM-ing the host, lower the parallel job count:\n> `MAX_JOBS=4 pip install flash-attn --no-build-isolation`. The same flag\n> applies to the FA3 `setup.py` build.\n\n---\n\n### 🟣 NPU (Ascend)\n\n#### 1. Install Ascend CANN 8.5.0 (one-time, before any conda step)\n\nOSP-Next is pinned to **CANN 8.5.0** — older toolkits (e.g. 8.0.x) are missing\nseveral operators we rely on, and newer pre-release branches have not been\nvalidated. Grab the matching installer for your hardware (Atlas 800T A2 \u002F\nAscend 950PR \u002F …) from the **official Ascend portal**:\n\n> 🔗 **Download:** \u003Chttps:\u002F\u002Fwww.hiascend.com\u002Fcann\u002Fdownload>\n>\n> Pick the **`8.5.0`** release that matches your OS and architecture\n> (e.g. *Ubuntu 22.04 aarch64* or *openEuler 22.03 aarch64*), then follow the\n> on-page installation guide. After install, the toolkit normally lives at\n> `\u002Fusr\u002Flocal\u002FAscend\u002Fascend-toolkit\u002F`, and step 2 below assumes that path —\n> adjust accordingly if you installed elsewhere.\n\n#### 2. Set up the conda env and install OSP-Next\n\n```bash\n# 2a. Source the Ascend toolkit (do this in EVERY new shell)\nsource \u002Fusr\u002Flocal\u002FAscend\u002Fascend-toolkit\u002Fset_env.sh\n\n# 2b. Create the conda env\nconda create -n ospnext-npu python=3.10 -y\nconda activate ospnext-npu\n\n# 2c. Install with the NPU extra\ncd \u002Fpath\u002Fto\u002FOSP-Next\npip install -e .[npu]\n```\n\nThe `[npu]` extra adds `torch_npu==2.8.0.post2` on top of the pinned\n`torch==2.8.0 + torchvision==0.23.0` from the core dependencies. It does **not**\ninstall `flash_attn` — on NPU we use SDPA \u002F custom kernels.\n\nEquivalent `pip -r` form:\n\n```bash\npip install -r requirements_npu.txt\n```\n\n#### 3. Build `decord` from source (aarch64 only)\n\n`decord` is used by the data pipeline to read training \u002F reward videos. It does\n**not** publish pre-built wheels for `aarch64 + Python 3.10`, which is exactly\nthe configuration most Ascend hosts (Kunpeng \u002F HiSilicon ARM CPUs) run on — a\nplain `pip install decord` will therefore fail or pull in an incompatible\nbinary. Build it once from source:\n\n```bash\n# 3a. System deps (Ubuntu \u002F openEuler \u002F OpenAnolis — pick your package manager)\nsudo apt-get install -y build-essential cmake ffmpeg \\\n                        libavcodec-dev libavfilter-dev libavformat-dev libavutil-dev\n# (openEuler \u002F CentOS users: dnf install -y gcc-c++ cmake ffmpeg-devel)\n\n# 3b. Clone and build (CPU-only — NPU has no CUDA decode path)\ngit clone --recursive https:\u002F\u002Fgithub.com\u002Fdmlc\u002Fdecord\ncd decord\nmkdir -p build && cd build\ncmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release\nmake -j$(nproc)\n\n# 3c. Install into the current conda env\ncd ..\u002Fpython\npython setup.py install\n```\n\n> ⚠️  If `python -c \"import decord\"` still raises `ImportError: libdecord.so:\n> cannot open shared object file` after install, the compiled native lib is not\n> on the loader path. Add the `build\u002F` directory once and persist it in your\n> conda env activate hook:\n>\n> ```bash\n> export LD_LIBRARY_PATH=$(pwd)\u002F..\u002Fbuild:$LD_LIBRARY_PATH\n> ```\n\n#### 4. *Optional* · Rebuild the HiF8 NPU Quant Kernel (only on import error)\n\nUsed **only** for the `osp_hif8_*` configs in `configs\u002Finfer\u002Fnpu\u002F` \u002F\n`configs\u002Ftrain\u002Fnpu\u002F`. **You normally do not need to do anything here** —\n`ospnext\u002Fquant_cy_npu\u002F` is shipped with a pre-compiled CANN 8.5.0 kernel\n(`libnpu_quant_op.so` + `npu_quant.cpython-3??-aarch64-linux-gnu.so`), so a\nplain `python -c \"from ospnext.quant_cy_npu import *\"` should already work.\n\nYou only have to rebuild when the import fails — typical symptoms are:\n\n- `ImportError: undefined symbol: ...` (mismatch between our shipped `.so` and your local CANN \u002F Python ABI)\n- The shipped `.so` is tagged for a different Python version (e.g. our wheel is built for `cpython-311`, but you installed `python=3.10` per [step 2](#2-set-up-the-conda-env-and-install-osp-next))\n- You're running on a CANN release we haven't validated against\n\nThe fix is to **re-build the kernel from the upstream HiFloat8 repository**\n([global-computing-consortium\u002FHiFloat8](https:\u002F\u002Fgithub.com\u002Fglobal-computing-consortium\u002FHiFloat8))\nand swap the resulting package into our tree:\n\n```bash\n# 4a. Clone upstream HiFloat8 anywhere outside this repo.\ngit clone https:\u002F\u002Fgithub.com\u002Fglobal-computing-consortium\u002FHiFloat8.git\ncd HiFloat8\u002Fhif8_npu\n\n# 4b. Re-build against your local CANN + Python (re-source set_env.sh first!).\nsource \u002Fusr\u002Flocal\u002FAscend\u002Fascend-toolkit\u002Fset_env.sh\nbash build_npu_ops.sh\n\n# 4c. Sanity-check the rebuild inside the upstream tree.\npython hif8_bf16.py     # expected: \"ABS diff max (zero values): 0\"\n\n# 4d. Replace OSP-Next's bundled package with the freshly-built one.\ncd \u002Fpath\u002Fto\u002FOSP-Next\nrm -rf ospnext\u002Fquant_cy_npu\ncp -r \u002Fpath\u002Fto\u002FHiFloat8\u002Fhif8_npu\u002Fquant_cy_npu ospnext\u002F\n\n# 4e. Final check inside the OSP-Next env.\npython -c \"from ospnext.quant_cy_npu import *; print('HiF8 kernel OK')\"\n```\n\n> ✅  After the swap, `python -c \"from ospnext.quant_cy_npu import *\"` should\n> succeed. If it still doesn't, re-source `set_env.sh` and re-run `bash\n> build_npu_ops.sh` inside the **same** shell — the build is sensitive to\n> environment leaks between sessions.\n\n---\n\n## 🎥 Inference Pipeline\n\nInference is launched through `infer\u002Finfer_osp.py` with a YAML config. The\ntypical flow is:\n\n1. Copy \u002F edit a config under `configs\u002Finfer\u002F{gpu,npu}\u002F`.\n2. Update the `\u002Fpath\u002Fto\u002F...` placeholders to point to your local weights \u002F\n   prompts \u002F output dir.\n3. Run the matching shell script under `scripts\u002Finfer\u002F{gpu,npu}\u002F`.\n\n### 📝 Step 1 — Edit the config\n\nExample: `configs\u002Finfer\u002Fgpu\u002Fosp_14b.yaml`. The fields you almost always need\nto change are highlighted below:\n\n```yaml\nmodel_name: \"osp_next\"\npipeline_name: \"t2v\"\nseed: 1024\n\nprompt_txt: \"assets\u002Ft2v\u002Fsimple_prompts.txt\"   # 🔧 one prompt per line\noutput_dir: \"\u002Fpath\u002Fto\u002Foutput\"                  # 🔧 where to save *.mp4\n\nnum_frames: 81                                 # video length\nheight: 720                                    # spatial resolution\nwidth: 1280\nsave_fps: 16                                   # output mp4 fps\nbatch_size: 1                                  # per-rank batch size\n\nfsdp_size: 8                                   # FSDP world size\nsp_size: 4                                     # Ulysses SP size\nskiparse_sp_size: 4                            # Skiparse SP size\nuse_sequence_parallel: False                   # toggle Ulysses SP\nuse_skiparse_sequence_parallel: True           # toggle Skiparse SP\nreshard_after_forward: Null                    # FSDP2 setting, leave Null\nexplicit_prefetching_num_blocks: 2             \nweight_dtype: \"bf16\"                           # bf16 \u002F fp16 \u002F fp32\nsave_with_dcp_api: False                       # MUST match the flag used when the checkpoint was saved\n\nmodel_config:\n  dim: 5120                                    # 14B = 5120; 1.3B = 1536\n  ffn_dim: 13824\n  num_heads: 40\n  num_layers: 40\n  skiparse_model_type: \"dual_end\"              # 'full' disables skiparse\n  sparse_ratio: 2\n  num_full_blocks: 8\n  pretrained_model_dir_or_checkpoint: \"\u002Fpath\u002Fto\u002Fmodel\"   # 🔧 your weights\n\nscheduler_config:\n  scheduler_name: \"flow_matching\"\n  num_inference_steps: 50                      # quality vs speed\n  shift: 7.0                                   # flow-matching shift\n  guidance_scale: 5.0                          # CFG guidance scale\n\nvae_config:\n  vae_path: \"\u002Fpath\u002Fto\u002Fvae\"                     # 🔧 VAE checkpoint\n  dtype: \"fp32\"\n\ntext_encoder_config:\n  text_len: 512\n  checkpoint_path: \"\u002Fpath\u002Fto\u002Ftext_encoder\"     # 🔧 T5 checkpoint\n  text_tokenizer_path: \"\u002Fpath\u002Fto\u002Ftext_tokenizer\"  # 🔧 T5 tokenizer\n  use_fsdp: True                               # FSDP-shard the T5 encoder\n```\n\n🔧 marked fields **must** be filled in for the run to succeed. Everything else\nhas reasonable defaults inside the code.\n\n### 🚀 Step 2 — Pick a launch script\n\n| Backend | Model            | Script                                          | Config                                       |\n|---------|------------------|-------------------------------------------------|----------------------------------------------|\n| GPU     | OSP-Next 14B     | `scripts\u002Finfer\u002Fgpu\u002Finfer_osp_14b.sh`            | `configs\u002Finfer\u002Fgpu\u002Fosp_14b.yaml`            |\n| GPU     | OSP-Next 1.3B †  | `scripts\u002Finfer\u002Fgpu\u002Finfer_osp_1_3b.sh`           | `configs\u002Finfer\u002Fgpu\u002Fosp_1_3b.yaml`           |\n| NPU     | OSP-Next 14B     | `scripts\u002Finfer\u002Fnpu\u002Finfer_osp_14b.sh`            | `configs\u002Finfer\u002Fnpu\u002Fosp_14b.yaml`            |\n| NPU     | OSP-Next 1.3B †  | `scripts\u002Finfer\u002Fnpu\u002Finfer_osp_1_3b.sh`           | `configs\u002Finfer\u002Fnpu\u002Fosp_1_3b.yaml`           |\n| NPU     | HiF8 14B ‡       | `scripts\u002Finfer\u002Fnpu\u002Finfer_osp_hif8_14b.sh`       | `configs\u002Finfer\u002Fnpu\u002Fosp_hif8_14b.yaml`       |\n| NPU     | HiF8 1.3B † ‡    | `scripts\u002Finfer\u002Fnpu\u002Finfer_osp_hif8_1_3b.sh`      | `configs\u002Finfer\u002Fnpu\u002Fosp_hif8_1_3b.yaml`      |\n\n> † &nbsp;No official **1.3B** checkpoint is released — the 1.3B scripts and\n> configs are kept as ready-to-go templates if you choose to train your own\n> 1.3B variant from scratch.\u003Cbr\u002F>\n> ‡ &nbsp;The HiF8 scripts require the **HiF8 NPU quant kernel** to import\n> successfully — verify with `python -c \"from ospnext.quant_cy_npu import *\"`,\n> and re-build via [NPU setup step 4](#4-optional--rebuild-the-hif8-npu-quant-kernel-only-on-import-error) if needed.\n\n### ▶️ Step 3 — Run\n\n```bash\n# Single node — defaults to NPRC_PER_NODE=8.\n# For an 8×NPU node, just run as-is. For a 16×NPU node, override:\n#   NPRC_PER_NODE=16 bash scripts\u002Finfer\u002Fnpu\u002Finfer_osp_14b.sh\nbash scripts\u002Finfer\u002Fgpu\u002Finfer_osp_14b.sh\n\n# Multi-node — override env vars (inference uses NNODES; see Tips & Troubleshooting).\nNNODES=4 MASTER_ADDR=10.0.0.1 MASTER_PORT=29500 \\\n    bash scripts\u002Finfer\u002Fgpu\u002Finfer_osp_14b.sh\n```\n\nOutputs:\n\n```\n${output_dir}\u002F\n├── config.yaml            # a snapshot of the launch config (rank 0)\n├── video_0.mp4            # one mp4 per prompt, named after its prompt index\n├── video_1.mp4\n├── ...\n└── video_grid.mp4         # NxN tiled preview of all generated clips\n```\n\n---\n\n## 🏋️ Training Pipeline\n\nTwo entry points are provided:\n\n| Entry                   | Purpose                                | Optimizer target            |\n|-------------------------|----------------------------------------|-----------------------------|\n| `train\u002Ftrain_osp.py`    | Supervised fine-tuning (SFT)           | Full-parameter \u002F FSDP2      |\n| `train\u002Ftrain_osp_RL.py` | Mix-GRPO RL post-training w\u002F LoRA      | LoRA adapters only          |\n\n### 📚 Data Preparation\n\n> ℹ️  **SFT only.** The RL pipeline (`train\u002Ftrain_osp_RL.py`) consumes a plain\n> text prompt file (one prompt per line) and **does not need this step**. See\n> the [RL section](#-reinforcement-learning-mix-grpo--lora) for that format.\n\nSFT reads training videos from an **LMDB-backed meta store**. Building it is a\nthree-step pipeline:\n\n#### Step 1 — Write a meta JSON for every video corpus\n\nFor each batch of training videos, produce a JSON file describing each clip:\n\n```jsonc\n[\n  {\n    \"path\": \"path\u002Fto\u002Fa\u002Fvideo.mp4\",             \u002F\u002F 🔧 required — video file path\n    \"cap\":  \"A stylish woman walks down ...\",  \u002F\u002F 🔧 required — caption\n    \"resolution\": {\"height\": 1080, \"width\": 1920},  \u002F\u002F optional, auto-probed if absent\n    \"fps\": 24,                                  \u002F\u002F optional, auto-probed if absent\n    \"num_frames\": 81,                           \u002F\u002F optional, auto-probed if absent\n    \"cut\": [0, 81]                              \u002F\u002F optional — [start, end) frame\n                                                \u002F\u002F indices when the JSON points\n                                                \u002F\u002F to a sub-clip of a long video\n  },\n  {\n    \"path\": \"...\",\n    \"cap\":  \"...\"\n  }\n]\n```\n\nYou can have many such JSON files — one per corpus \u002F per source.\n\n#### Step 2 — Write the annotation index (`ann_txt`)\n\nCreate a `.txt` index that tells the filter where each meta JSON lives and what\nits videos' root directory is. **One line per JSON, format:**\n\n```text\n\u003Cvideos_root_dir>,\u003Cabsolute_path_to_meta_json>\n```\n\nFor example, `all_videos.txt`:\n\n```text\n\u002Fdata\u002Fvideo_corpus_A,\u002Fdata\u002Fvideo_corpus_A\u002Fmeta.json\n\u002Fdata\u002Fvideo_corpus_B,\u002Fdata\u002Fvideo_corpus_B\u002Fmeta.json\n```\n\nThe filter will prepend `\u003Cvideos_root_dir>` to each relative `path` field\ninside the corresponding JSON.\n\n#### Step 3 — Configure the filter and build the LMDB\n\nEdit `configs\u002Ffilter_config.yaml`:\n\n```yaml\nann_txt_path: \"all_videos.txt\"           # 🔧 the index from Step 2\nsave_path:    \"\u002Fpath\u002Fto\u002Ftrain\u002Fdataset\"   # 🔧 destination LMDB folder\nsample_height:     720                   # videos will be filtered to fit this\nsample_width:      1280\nsample_num_frames: 81\ntrain_fps:         16                    # target training fps\nmin_hxw:           921600                # min H×W; reject anything smaller\n                                         #   1080×1920 → 2_073_600 (use 2_000_000)\n                                         #    864×1536 → 1_327_104\n                                         #    720×1280 →   921_600\n                                         #    576×1024 →   589_824\n                                         #    480×832  →   399_360\nmax_h_div_w_ratio: 1.2                   # reject overly portrait videos\nmin_h_div_w_ratio: 0.4                   # reject overly landscape videos\nmax_motion_value:  0.02                  # reject overly static \u002F overly shaky\n```\n\nThen run:\n\n```bash\nbash scripts\u002Ffilter_data.sh\n# equivalent to:\n#   python filter_data.py --filter_config configs\u002Ffilter_config.yaml\n```\n\nThis produces an LMDB at `save_path`, which becomes the actual dataset\nconsumed by `train\u002Ftrain_osp.py`. **Two things** must be flipped in your\ntraining config to use it (the shipped configs default to the random-tensor\ndebug dataset — see the callout below):\n\n```yaml\ndata_config:\n  dataset_name: \"wan_t2v\"                              # ← real LMDB dataset\n  dataset_config:\n    metafile_or_dir_path: \"\u002Fpath\u002Fto\u002Ftrain\u002Fdataset\"    # ← Step-3 save_path\n    ...\n```\n\n> 💡  **Why LMDB?** LMDB keeps memory usage flat during training and avoids the\n> memory leaks that pile up when `decord` opens \u002F closes thousands of video\n> readers across DataLoader workers.\n\n> 🧪 **`t2v_random` vs `wan_t2v` — pick the right one**\n>\n> All shipped `configs\u002Ftrain\u002F**.yaml` files set `dataset_name: \"t2v_random\"`,\n> which is a **synthetic random-tensor dataset** (`T2VRandomDataset`) used to\n> smoke-test the training loop *without any real data on disk* — convenient\n> for verifying that FSDP2 \u002F SP \u002F the optimizer step are all wired up\n> correctly. For an actual training run you **must**:\n>\n> 1. Build the LMDB through this Data Preparation pipeline (Steps 1-3 above).\n> 2. Set `data_config.dataset_name: \"wan_t2v\"` (this picks `WanT2VDataset`).\n> 3. Set `data_config.dataset_config.metafile_or_dir_path` to the LMDB folder.\n>\n> Forgetting any of these silently trains the model on random noise — loss\n> will look \"fine\" but the model learns nothing.\n\n### 🎓 Supervised Fine-Tuning (SFT)\n\n#### 📝 Step 1 — Edit the training config\n\nExample: `configs\u002Ftrain\u002Fgpu\u002Fosp_14b.yaml`:\n\n```yaml\nmodel_name: \"osp_next\"\nseed: 1024\n\noutput_dir: \"\u002Fpath\u002Fto\u002Foutput\"            # 🔧 checkpoint root\ntraining_iteration: 1000000              # total steps\nfsdp_size: 8                             # FSDP world size\nsp_size: 4\nskiparse_sp_size: 4\nuse_sequence_parallel: False             # Ulysses SP\nuse_skiparse_sequence_parallel: True     # Skiparse SP (recommended)\ngradient_checkpointing: True             # memory ↘️ , compute ↗️\ngradient_accumulation_steps: 1\ninit_max_grad_norm: 1.0\nlog_interval: 1\nsave_interval: 1000                      # save every N steps\nweight_dtype: \"bf16\"\nema_decay: 0.9999                        # GPU default; NPU 14B recipe uses 0.999, NPU 1.3B uses 0.9993\nema_update_interval: 1\nsave_with_dcp_api: True\n\nwandb_config:\n  project_name: \"osp_next\"               # 🔧 your wandb project\n  exp_name:     \"osp_next\"               # 🔧 run name\n\nmodel_config:\n  dim: 5120\n  ffn_dim: 13824\n  num_heads: 40\n  num_layers: 40\n  skiparse_model_type: \"dual_end\"\n  sparse_ratio: 2\n  num_full_blocks: 8\n  pretrained_model_dir_or_checkpoint: \"\u002Fpath\u002Fto\u002Fmodel\"   # 🔧 init weights\n\nscheduler_config:\n  scheduler_name: \"flow_matching\"\n  use_dynamic_shifting: True\n  use_logitnorm_time_sampling: True\n\nvae_config:\n  vae_path: \"\u002Fpath\u002Fto\u002Fvae\"               # 🔧 frozen VAE\n  dtype: \"fp32\"\n\ntext_encoder_config:\n  text_len: 512\n  checkpoint_path: \"\u002Fpath\u002Fto\u002Ftext_encoder\"  # 🔧 frozen T5\n  use_fsdp: True\n\ndata_config:\n  batch_size: 1                          # per-rank batch size\n  num_workers: 16\n  shuffle: True\n  # ⚠️  \"t2v_random\" is a synthetic random-tensor dataset — only use it to\n  #    smoke-test the loop. For real training, switch to \"wan_t2v\" and set\n  #    metafile_or_dir_path to the LMDB built in Data Preparation Step 3.\n  dataset_name: \"t2v_random\"             # 🔧 change to \"wan_t2v\" for real training\n  dataset_config:\n    text_tokenizer_path: \"\u002Fpath\u002Fto\u002Ftext_tokenizer\"   # 🔧\n    # metafile_or_dir_path: \"\u002Fpath\u002Fto\u002Ftrain\u002Fdataset\" # 🔧 REQUIRED for wan_t2v\n    text_drop_ratio: 0.1\n    sample_height: 720\n    sample_width: 1280\n    sample_num_frames: 81\n    tokenizer_max_length: 512\n    return_prompt_mask: True\n  sampler_name: \"stateful_distributed\"\n  collator_name: \"wan_t2v\"               # collator stays \"wan_t2v\" for both modes\n\noptimizer_config:\n  lr: 0.00002\n  weight_decay: 0\n```\n\n#### 🚀 Step 2 — Launch\n\n| Backend | Model           | Script                                       | Config                                  |\n|---------|-----------------|----------------------------------------------|-----------------------------------------|\n| GPU     | 14B             | `scripts\u002Ftrain\u002Fgpu\u002Ftrain_osp_14b.sh`         | `configs\u002Ftrain\u002Fgpu\u002Fosp_14b.yaml`        |\n| GPU     | 1.3B            | `scripts\u002Ftrain\u002Fgpu\u002Ftrain_osp_1_3b.sh`        | `configs\u002Ftrain\u002Fgpu\u002Fosp_1_3b.yaml`       |\n| NPU     | 14B             | `scripts\u002Ftrain\u002Fnpu\u002Ftrain_osp_14b.sh`         | `configs\u002Ftrain\u002Fnpu\u002Fosp_14b.yaml`        |\n| NPU     | 1.3B            | `scripts\u002Ftrain\u002Fnpu\u002Ftrain_osp_1_3b.sh`        | `configs\u002Ftrain\u002Fnpu\u002Fosp_1_3b.yaml`       |\n| NPU     | HiF8 14B ‡      | *(copy `train_osp_14b.sh` ↓)*                | `configs\u002Ftrain\u002Fnpu\u002Fosp_hif8_14b.yaml`   |\n| NPU     | HiF8 1.3B ‡     | *(copy `train_osp_1_3b.sh` ↓)*               | `configs\u002Ftrain\u002Fnpu\u002Fosp_hif8_1_3b.yaml`  |\n\n> ‡ &nbsp;HiF8 SFT does **not** ship its own launch script — copy\n> `scripts\u002Ftrain\u002Fnpu\u002Ftrain_osp_14b.sh` to `train_osp_hif8_14b.sh` and only\n> change the `--config` flag to the HiF8 yaml (e.g.\n> `--config configs\u002Ftrain\u002Fnpu\u002Fosp_hif8_14b.yaml`). All other env vars \u002F FSDP\n> settings carry over. Make sure the HiF8 NPU kernel imports cleanly first\n> (see [NPU setup step 4](#4-optional--rebuild-the-hif8-npu-quant-kernel-only-on-import-error)).\n\n```bash\n# Single node (default NPRC_PER_NODE=8)\nbash scripts\u002Ftrain\u002Fgpu\u002Ftrain_osp_14b.sh\n\n# Multi-node — training scripts read PET_NNODES + RANK (NOT NNODES \u002F NODE_RANK)\nPET_NNODES=4 RANK=0 MASTER_ADDR=10.0.0.1 MASTER_PORT=29501 \\\n    bash scripts\u002Ftrain\u002Fgpu\u002Ftrain_osp_14b.sh\n# … on every other node, bump RANK accordingly: RANK=1, RANK=2, RANK=3\n```\n\nResuming is automatic — `Checkpointer.last_training_iteration` picks the most\nrecent checkpoint folder under `output_dir`.\n\n### 🎯 Reinforcement Learning (Mix-GRPO + LoRA)\n\n> 🥇 &nbsp;**First RL pipeline for sparse video diffusion.** To the best of our\n> knowledge, OSP-Next is the first project to apply RL post-training directly to\n> a *sparse* video diffusion model — see the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.28691) for the\n> design rationale.\n\nThe RL post-training uses the same FSDP2 backbone but trains a **LoRA** adapter\non top of frozen base weights, sampled with **SDE → ODE hybrid** denoising,\noptimized with **Mix-GRPO** against a **VideoAlign reward**.\n\n#### 📝 Step 1 — Edit the RL config\n\nExample: `configs\u002Ftrain\u002Fnpu\u002Fosp_14b_RL.yaml`. The RL-specific blocks\n(`lora_config`, `rl_config`) are what you tune most:\n\n```yaml\nmodel_name: \"osp_next\"\nseed: 42\noutput_dir: \"\u002Fpath\u002Fto\u002Foutput\"            # 🔧 RL checkpoint root\n\nnum_epochs: 1000                         # RL epochs (not SFT steps)\nfsdp_size: 16\nsp_size: 4\nskiparse_sp_size: 4\nuse_sequence_parallel: False\nuse_skiparse_sequence_parallel: True\nreshard_after_forward: Null\nexplicit_prefetching_num_blocks: 0\ngradient_checkpointing: True\ngradient_accumulation_steps: 1\ninit_max_grad_norm: 1.0\nlog_interval: 1\nsave_interval: 500\nweight_dtype: \"bf16\"\nema_decay: 0.999                         # use 0.999 for 14B\nema_update_interval: 1\nsave_with_dcp_api: True\nmodel_cpu_offload: False\nencoder_cpu_offload: False\nprofiling: False\n\nwandb_config:\n  project_name: \"osp_next_RL\"\n  exp_name: \"osp_next_RL\"\n\nmodel_config:\n  # ↓ keep dim \u002F num_heads \u002F num_layers \u002F skiparse_* identical to your SFT\n  #   config — only the LoRA adapter is being trained, the base must match.\n  dim: 5120\n  ffn_dim: 13824\n  freq_dim: 256\n  in_dim: 16\n  num_heads: 40\n  num_layers: 40\n  out_dim: 16\n  text_len: 512\n  skiparse_model_type: \"dual_end\"\n  sparse_ratio: 2\n  num_full_blocks: 8\n  pretrained_model_dir_or_checkpoint: \"\u002Fpath\u002Fto\u002Fmodel\"   # 🔧 base ckpt\n\nscheduler_config:\n  scheduler_name: \"flow_matching\"\n  use_dynamic_shifting: True\n  use_logitnorm_time_sampling: True\n\nvae_config:\n  vae_path: \"\u002Fpath\u002Fto\u002Fvae\"               # 🔧\n  dtype: \"fp16\"                          # RL uses fp16 (rollout-only VAE saves VRAM); SFT\u002Finfer use fp32\n\ntext_encoder_config:\n  text_len: 512\n  checkpoint_path: \"\u002Fpath\u002Fto\u002Ftext_encoder\"           # 🔧\n  text_tokenizer_path: \"\u002Fpath\u002Fto\u002Ftokenizer\"          # 🔧\n  use_fsdp: True\n\n# RL training uses a text-only prompt dataset; only the tokenizer is needed.\ndata_config:\n  dataset_config:\n    text_tokenizer_path: \"\u002Fpath\u002Fto\u002Ftokenizer\"        # 🔧\n    tokenizer_max_length: 512\n\noptimizer_config:\n  lr: 0.00002                            # 2e-5 for the LoRA optimizer\n  weight_decay: 0.001\n\nlora_config:\n  rank: 32                               # LoRA rank\n  alpha: 64                              # LoRA alpha\n  target_modules:                        # which projections get LoRA\n    - \"self_attn.q\"\n    - \"self_attn.k\"\n    - \"self_attn.v\"\n    - \"self_attn.o\"\n    - \"cross_attn.q\"\n    - \"cross_attn.k\"\n    - \"cross_attn.v\"\n    - \"cross_attn.o\"\n  # lora_path: \"\u002Fpath\u002Fto\u002Fexisting\u002Flora\"  # uncomment to resume from a LoRA ckpt\n\nrl_config:\n  prompt_file:      \"\u002Fpath\u002Fto\u002Fprompt_file\"           # 🔧 train prompts (txt)\n  eval_prompt_file: \"\u002Fpath\u002Fto\u002Feval_prompt_file\"      # 🔧 eval prompts (txt)\n  height: 720\n  width:  1280\n  num_frames: 81\n  sde_steps: 10                          # # of steps trained with SDE noise\n  num_inference_steps: 25                # total denoising steps in rollout sampling\n  guidance_scale: 5.0                    # CFG scale used during rollout sampling\n  kl_beta: 0.004                         # KL penalty weight (set 0 to disable)\n  num_batches_per_epoch: 4               # batches per RL epoch\n  num_image_per_prompt: 4                # k repeats (Mix-GRPO group size)\n  sample_time_per_prompt: 1              # how many times each prompt is rolled out per epoch\n  sample_batch_size: 2                   # batch size during rollout\n  train_batch_size:  2                   # batch size during policy update\n  eval_num_steps: 50                     # denoising steps used in the eval pass\n  eval_freq: 20                          # run eval every N RL epochs\n  use_cfg_in_train: True                 # apply CFG in the policy update too\n  adv_clip_max: 5.0                      # max abs value for advantage clipping\n  clip_range: 1e-4                       # PPO-style ratio clip range\n  reward_fn:\n    videoalign: 1.0                      # 🔧 VideoAlign weight (1.0 → only reward used);\n                                         #    set the actual checkpoint path through the\n                                         #    `load_from_pretrained` kwarg in\n                                         #    ospnext\u002Frewards\u002Frewards.py :: multi_score(),\n                                         #    otherwise scorer init will fail at startup.\n```\n\n#### 🚀 Step 2 — Launch\n\n| Backend | Model | Script                                    | Config                                |\n|---------|-------|-------------------------------------------|----------------------------------------|\n| GPU     | 14B   | `scripts\u002Ftrain\u002Fgpu\u002Ftrain_osp_14b_RL.sh`   | `configs\u002Ftrain\u002Fgpu\u002Fosp_14b_RL.yaml`   |\n| NPU     | 14B   | `scripts\u002Ftrain\u002Fnpu\u002Ftrain_osp_14b_RL.sh`   | `configs\u002Ftrain\u002Fnpu\u002Fosp_14b_RL.yaml`   |\n\n```bash\nbash scripts\u002Ftrain\u002Fnpu\u002Ftrain_osp_14b_RL.sh\n```\n\n#### 📦 What gets saved during RL\n\nThe RL trainer **only persists the LoRA adapter** — the base model is frozen,\nso we deliberately skip saving its weights to keep checkpoints small and\ndeployable. Each save (`save_interval` epochs + a final save) produces:\n\n```\n${output_dir}\u002F\n├── lora-checkpoint-10\u002F                   # 🎯 current LoRA (deployable)\n│   ├── adapter_model.bin                 # LoRA matrices only\n│   ├── adapter_config.json               # PEFT LoRA config\n│   ├── adaptive_grad_clipper.pt          # grad-clipper EMA state (resume helper)\n│   └── rl_training_state.json            # epoch \u002F global_step bookkeeping\n└── lora-checkpoint-10-ema\u002F               # 🎯 EMA-averaged LoRA (recommended for inference)\n    ├── adapter_model.bin\n    └── adapter_config.json\n```\n\n> 💡 Use `lora-checkpoint-{step}-ema\u002F` for inference (matches what's used during\n> in-training eval). Use the plain `lora-checkpoint-{step}\u002F` if you want to\n> resume RL training — point `lora_config.lora_path` at it to pick up the LoRA\n> weights and the sidecar `rl_training_state.json` \u002F grad-clipper state.\n\n#### 🔗 Step 3 — Merge LoRA back into the base model\n\nOSP-Next inference (`infer\u002Finfer_osp.py`) loads a **plain (merged) base model**,\nnot a `PeftModel`. After RL training finishes, run `merge_lora_weights.py` to\nfold the LoRA delta into the frozen base weights and save a single\ndeployment-ready checkpoint:\n\n```python\n# merge_lora_weights.py — edit the four paths at the bottom and run once.\nfrom ospnext.modules.osp_next import OSPNextModel\nfrom merge_lora_weights import load_lora_and_merge\n\nmodel_path = \"\u002Fpath\u002Fto\u002Fosp_next_base\"                         # 🔧 same base used during RL\nlora_path  = \"\u002Fpath\u002Fto\u002Foutput_dir\u002Flora-checkpoint-1000-ema\u002Fadapter_model.bin\"  # 🔧 prefer the -ema variant\nsave_path  = \"\u002Fpath\u002Fto\u002Fmerged_osp_next_rl\"                    # 🔧 destination\n\nmodel = OSPNextModel.from_pretrained(model_path)\nmodel = load_lora_and_merge(\n    model=model,\n    lora_path=lora_path,\n    lora_rank=32,                  # must match lora_config.rank used in RL\n    lora_alpha=64,                 # must match lora_config.alpha used in RL\n    lora_target_modules=[\n        \"self_attn.q\", \"self_attn.k\", \"self_attn.v\", \"self_attn.o\",\n        \"cross_attn.q\", \"cross_attn.k\", \"cross_attn.v\", \"cross_attn.o\",\n    ],\n)\nmodel.save_pretrained(save_path)\n```\n\nOr just edit the four paths at the bottom of `merge_lora_weights.py` directly\nand run:\n\n```bash\npython merge_lora_weights.py\n```\n\nThe script will (1) wrap the base with the same PEFT `LoraConfig`, (2) load\nthe trained LoRA weights, (3) call `peft.merge_and_unload()` to fold LoRA into\nthe base, and (4) `save_pretrained()` the merged model.\n\n#### 🎬 Step 4 — Run inference with the merged model\n\nPoint your inference config's `pretrained_model_dir_or_checkpoint` at the\nmerged directory and launch as usual:\n\n```yaml\n# configs\u002Finfer\u002F{gpu,npu}\u002Fosp_14b.yaml\nmodel_config:\n  pretrained_model_dir_or_checkpoint: \"\u002Fpath\u002Fto\u002Fmerged_osp_next_rl\"   # 🔧\n```\n\n```bash\nbash scripts\u002Finfer\u002Fgpu\u002Finfer_osp_14b.sh           # or the NPU variant\n```\n\n---\n\n## 🛠️ Tips & Troubleshooting\n\n### 🔧 Sequence-parallel sizing\n\nInside any config, `fsdp_size × ddp_size = world_size`. The two SP groups\nmultiply inside the FSDP group:\n\n```\nsp_size  × skiparse_sp_size  ≤  fsdp_size\n```\n\nFor the 14B model with `sparse_ratio=2`, valid pairs (per-rank shard count\nmust evenly divide `sparse_ratio² = 4`) are:\n\n| `sp_size` | `skiparse_sp_size` | total SP factor |\n|----------:|-------------------:|----------------:|\n| 1         | 4                  | 4               |\n| 2         | 2                  | 4               |\n| 4         | 1                  | 4               |\n| 1         | 1                  | 1 (no SP)       |\n\n\n\n### 🐛 Common failures\n\n| Symptom                                             | Fix                                                                                                  |\n|-----------------------------------------------------|------------------------------------------------------------------------------------------------------|\n| `ImportError: cannot import name 'flash_attn'`      | Either install flash-attn manually, or ignore — code already falls back to SDPA.                     |\n| `RuntimeError: NPU error ... aclrtSetDevice`        | Forgot `source \u002Fusr\u002Flocal\u002FAscend\u002Fascend-toolkit\u002Fset_env.sh` before activating the conda env.         |\n| NPU op missing \u002F `aclnnXxx not found` at runtime    | Your CANN toolkit is older than the required 8.5.0 — re-install from \u003Chttps:\u002F\u002Fwww.hiascend.com\u002Fcann\u002Fdownload>. |\n| `pip install decord` fails on Ascend \u002F aarch64       | No prebuilt wheel exists for aarch64 + Py3.10 — build from source (see [step 3 of the NPU setup](#3-build-decord-from-source-aarch64-only)). |\n| `ImportError: libdecord.so: cannot open shared object` | Add the decord `build\u002F` directory to `LD_LIBRARY_PATH` (see the callout right under the decord build steps). |\n| `wandb` prompts for login                            | Either run `wandb login` once, or set `WANDB_MODE=offline` (every training script already does this). |\n| `from ospnext.quant_cy_npu import ...` fails        | Bundled kernel ABI mismatch — rebuild from upstream HiFloat8 and swap the package in (see [step 4 of the NPU setup](#4-optional--rebuild-the-hif8-npu-quant-kernel-only-on-import-error)). |\n| RL reward init crashes on startup                   | `multi_score` in `ospnext\u002Frewards\u002Frewards.py` currently calls `videoalign_score(device)` without a checkpoint path — wire the actual path through `load_from_pretrained=` and re-run.       |\n\n### 📚 Environment variables worth knowing\n\nAll shipped launch scripts (`scripts\u002F{infer,train}\u002F{gpu,npu}\u002F*.sh`) read these\nvariables via `${VAR:-default}`, so you can override them inline on the command\nline without touching the scripts.\n\n| Variable                      | Used by              | Default                       | Purpose                                                            |\n|-------------------------------|----------------------|-------------------------------|--------------------------------------------------------------------|\n| `MASTER_ADDR`                 | infer + train        | `127.0.0.1`                   | torchrun rendezvous host                                            |\n| `MASTER_PORT`                 | infer + train        | `29505` (infer) \u002F `29501` (train) | torchrun rendezvous port                                        |\n| `NPRC_PER_NODE` ⚠️             | infer + train        | `8`                           | Processes per node (see typo note below)                            |\n| `NNODES`                      | **infer only**       | `1`                           | Total nodes — used by the inference scripts                         |\n| `PET_NNODES`                  | **train only**       | `1`                           | Total nodes — used by the training scripts (legacy `pet`-style name) |\n| `RANK`                        | **train only**       | `0`                           | This node's rank (`0` for single-node, `0..N-1` for multi-node)     |\n| `WANDB_MODE`                  | train                | `offline` (preset in scripts) | Set to `online` to enable WandB upload, or keep `offline`           |\n| `PYTORCH_NPU_ALLOC_CONF`      | NPU train            | `expandable_segments:True`    | NPU memory allocator (preset in NPU scripts)                        |\n\n> ⚠️  **`NPRC_PER_NODE` is a typo**, but it's the variable the launch scripts\n> actually look for. If you set `NPROC_PER_NODE` (the conventional spelling)\n> it will be **silently ignored** and the script will fall back to `8`. We\n> kept the typo to preserve backward-compat with existing run histories — a\n> proper rename is tracked as a follow-up. **Always override with\n> `NPRC_PER_NODE=...`** until that is fixed.\n\n> 🔀 **Inference vs training use different multi-node vars.** Inference scripts\n> read `NNODES`, training scripts read `PET_NNODES` + `RANK`. See the\n> \"single \u002F multi-node\" snippets in the\n> [Inference Pipeline](#-inference-pipeline) and\n> [Training Pipeline](#-training-pipeline) sections for copy-paste examples.\n\n---\n\n## 🙏 Acknowledgements\n\nOSP-Next stands on the shoulders of giants. We gratefully build on:\n\n- 🌊 [**Wan**](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1) — the WAN-VAE and T5 backbone\n  components that power our text-to-video stack.\n- 🎬 [**Open-Sora-Plan**](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan) — the open-source\n  video diffusion ecosystem this project directly extends.\n- 🏅 [**VideoAlign**](https:\u002F\u002Fgithub.com\u002FKwaiVGI\u002FVideoAlign) — the multi-axis video-quality\n  reward model used during RL post-training.\n- 🎯 [**Mix-GRPO**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21802) — the mixed ODE-SDE flow-matching RL\n  algorithm at the heart of our sparse-model post-training pipeline.\n\nWe also welcome contributions of every size — bug reports, feature requests, and\nPRs all go a long way! Please file an [issue](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOSP-Next\u002Fissues)\nor open a pull request.\n\n---\n\n## 📝 Citation\n\nIf you find OSP-Next useful in your research, please consider citing:\n\n```bibtex\n@misc{ge2026ospnextefficienthighqualityvideo,\n      title={OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning}, \n      author={Yunyang Ge and Xianyi He and Zezhong Zhang and Bin Lin and Bin Zhu and Xinhua Cheng and Li Yuan},\n      year={2026},\n      eprint={2605.28691},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.28691}, \n}\n```\n\nRelated work this project builds on:\n\n```bibtex\n@article{wan2025wan,\n  title={Wan: Open and advanced large-scale video generative models},\n  author={Wan, Team and Wang, Ang and Ai, Baole and Wen, Bin and Mao, Chaojie and Xie, Chen-Wei and Chen, Di and Yu, Feiwu and Zhao, Haiming and Yang, Jianxiao and others},\n  journal={arXiv preprint arXiv:2503.20314},\n  year={2025}\n}\n\n@article{lin2024open,\n  title={Open-sora plan: Open-source large video generation model},\n  author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},\n  journal={arXiv preprint arXiv:2412.00131},\n  year={2024}\n}\n\n@article{li2025mixgrpo,\n  title={Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde},\n  author={Li, Junzhe and Cui, Yutao and Huang, Tao and Ma, Yinping and Fan, Chun and Cheng, Yiming and Yang, Miles and Zhong, Zhao and Bo, Liefeng},\n  journal={arXiv preprint arXiv:2507.21802},\n  year={2025}\n}\n```\n\n---\n\n## 📄 License\n\nSee [`LICENSE.txt`](LICENSE.txt).\n\n---\n\n## ⭐ Star History\n\n\u003Cdiv align=\"center\">\n\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#PKU-YuanGroup\u002FOSP-Next&Date\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\"  srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=PKU-YuanGroup\u002FOSP-Next&type=Date&theme=dark\">\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=PKU-YuanGroup\u002FOSP-Next&type=Date\">\n    \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=PKU-YuanGroup\u002FOSP-Next&type=Date\" width=\"720\">\n  \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n\u003Cbr\u002F>\n\n\u003Csub>If this project helped you, a ⭐ goes a long way 🙌\u003C\u002Fsub>\n\n\u003C\u002Fdiv>\n\n","OSP-Next 是一个高效的高质量视频生成框架，通过稀疏序列并行、HiF8 量化和强化学习等技术提升性能。该项目的核心功能包括Skiparse-2D Attention，这是一种专为图像\u002F视频设计的固定规则稀疏注意力模式，能够接近3D全注意力的质量，并与FlashAttention内核兼容；Sparse Sequence Parallelism (SSP)则是一种与Skiparse-2D Attention协同设计的并行策略，显著减少了跨节点通信量，降低了长视频训练中的通信瓶颈；此外，HiF8量化方案在保持模型精度的同时实现了高达2.27倍的推理加速（适用于NPU）。Mix-GRPO + LoRA机制则是首次尝试将强化学习应用于稀疏视频生成模型中，进一步优化了生成效果。该框架适合需要高效生成高质量视频的应用场景，如内容创作、虚拟现实等。","2026-06-11 04:01:41","CREATED_QUERY"]