[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-85155":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":13,"stars30d":13,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":16,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":19,"hasPages":19,"topics":20,"createdAt":10,"pushedAt":10,"updatedAt":21,"readmeContent":22,"aiSummary":10,"trendingCount":13,"starSnapshotCount":13,"syncStatus":15,"lastSyncTime":23,"discoverSource":24},85155,"next-forcing","gangweix\u002Fnext-forcing","gangweix","Next Forcing: World Action Modeling with Multi-Chunk Prediction (MCP)","https:\u002F\u002Fgangweix.github.io\u002Fnext-forcing\u002F",null,"JavaScript",51,0,1,2,33,false,"main",true,[],"2026-06-15 10:04:58","\u003Ch1 align=\"center\">Next Forcing:\u003Cbr>Causal World Modeling with Multi-Chunk Prediction\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>Gangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu, Yujun Shen, Xin Yang, Yinghao Xu\u003C\u002Fstrong>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgangweix.github.io\u002Fnext-forcing\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-blue\" alt=\"Project page\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2606.11187\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-b31b1b\" alt=\"Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgangweix\u002Fnext-forcing\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode-coming_soon-lightgrey\" alt=\"Code\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## Overview\n\nNext Forcing tackles the myopic supervision problem in autoregressive video world models: next-chunk denoising often learns local appearance shortcuts instead of long-range dynamics, especially at high frame rates.\n\nBy training lightweight Multi-Chunk Prediction (MCP) modules to predict multiple future chunks, Next Forcing provides denser temporal supervision, achieves faster and more stable convergence across frame rates, sets new state-of-the-art results on RoboTwin, and enables `2x` inference acceleration via parallel chunk generation.\n\n## Highlights\n\n- **Multi-Chunk Prediction (MCP):** auxiliary modules predict `next^1`, `next^2`, and `next^3` chunks to provide long-range temporal supervision beyond the current chunk.\n- **Faster and stable training:** Next Forcing converges faster and reaches higher success rates across frame rates, with the strongest gains at high FPS where appearance shortcuts are most severe.\n- **LLM-style inference acceleration:** the MCP module can be retained at inference to predict the next chunk in parallel with the current chunk, similar in spirit to parallel\u002Fspeculative decoding in LLMs.\n\n## Method\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Fnext-forcing-method-architecture.png\" alt=\"Next Forcing method architecture\" width=\"95%\">\n\u003C\u002Fp>\n\nDuring training, the main model denoises the current chunk, while lightweight MCP modules predict multiple future chunks through a causal chain. These future prediction losses provide dense temporal supervision to the backbone and encourage the model to learn long-range dynamics instead of local appearance shortcuts.\n\nThe same trained checkpoint supports two inference modes:\n\n- **Zero-overhead mode:** remove MCP modules and run the main model exactly like the baseline.\n- **MCP-accelerated mode:** keep the first MCP module so one autoregressive step produces both the current chunk and the next chunk.\n\n## Results\n\n### Training Convergence\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Frobotwin-convergence-results.png\" alt=\"RoboTwin convergence comparison\" width=\"95%\">\n\u003C\u002Fp>\n\nNext Forcing converges faster than LingBot-VA across frame rates. The gain is most pronounced at `50 fps`: on the Random setting, Next Forcing reaches LingBot-VA's `45k`-step accuracy at only `20k` steps, corresponding to `2.3x` faster convergence.\n\n### Final RoboTwin Accuracy\n\nNext Forcing achieves the best average success rate on the RoboTwin benchmark across 50 bimanual manipulation tasks.\n\n| Setting | X-VLA | pi_0 | pi_0.5 | Motus | Being-H0.7 | Fast-WAM | LingBot-VA | **Next Forcing** |\n| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |\n| Clean | 72.9 | 65.9 | 82.7 | 88.7 | 90.2 | 91.9 | 92.9 | **94.1** |\n| Random | 72.8 | 58.4 | 76.8 | 87.0 | 89.6 | 91.8 | 91.5 | **93.5** |\n\n### Inference Acceleration\n\nMCP-accelerated inference predicts the next video chunk in parallel with the current chunk, reducing sequential video denoising cost while preserving comparable accuracy.\n\n| Inference Mode | 12 fps Clean | 12 fps Random | 25 fps Clean | 25 fps Random | 50 fps Clean | 50 fps Random |\n| --- | ---: | ---: | ---: | ---: | ---: | ---: |\n| Standard | 94.1 | 93.5 | 92.6 | 91.4 | 91.8 | 90.5 |\n| MCP-accelerated (`2x`) | 93.5 | 90.6 | 91.0 | 89.8 | 92.2 | 91.3 |\n\n### PhyWorld\n\nOn PhyWorld, Next Forcing improves both video quality and physical consistency over LingBot-VA.\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth rowspan=\"2\">Method\u003C\u002Fth>\n      \u003Cth colspan=\"2\">FVD (&darr;)\u003C\u002Fth>\n      \u003Cth colspan=\"2\">Abnormal Ratio (&darr;)\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>OOT\u003C\u002Fth>\n      \u003Cth>IT\u003C\u002Fth>\n      \u003Cth>OOT\u003C\u002Fth>\n      \u003Cth>IT\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd>LingBot-VA\u003C\u002Ftd>\n      \u003Ctd align=\"right\">5.3\u003C\u002Ftd>\n      \u003Ctd align=\"right\">3.5\u003C\u002Ftd>\n      \u003Ctd align=\"right\">12%\u003C\u002Ftd>\n      \u003Ctd align=\"right\">3%\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Cstrong>Next Forcing\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd align=\"right\">\u003Cstrong>4.7\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd align=\"right\">\u003Cstrong>3.2\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd align=\"right\">\u003Cstrong>8%\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd align=\"right\">\u003Cstrong>2%\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n### General Video Pretraining\n\nOn 3.5M in-house general video clips, Next Forcing also improves pure video generation after removing the action stream.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Fgeneral-video-fvd-curves.png\" alt=\"General video pretraining FVD curves\" width=\"80%\">\n\u003C\u002Fp>\n\nAt `50k` training steps, Next Forcing reduces FVD by `58%` on Test Set 1 (`94` vs. `225`) and by `52%` on Test Set 2 (`97` vs. `204`). It also surpasses LingBot-VA's `50k`-step FVD with only `10k` training steps.\n\n## Project Status\n\n- [x] Project page and demos\n- [x] Paper\n- [ ] Training and inference code\n- [ ] Model checkpoints\n\n## Citation\n\n```bibtex\n@article{nextforcing,\n  title={Next Forcing: Causal World Modeling with Multi-Chunk Prediction},\n  author={Gangwei Xu and Qihang Zhang and Jiaming Zhou and Xing Zhu and Yujun Shen and Xin Yang and Yinghao Xu},\n  journal={},\n  year={2026}\n}\n```\n","2026-06-15 02:30:10","CREATED_QUERY"]