[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83018":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":13,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":14,"stars30d":15,"stars90d":12,"forks30d":12,"starsTrendScore":13,"compositeScore":12,"rankGlobal":9,"rankLanguage":9,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":19,"hasPages":17,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":12,"starSnapshotCount":12,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},83018,"ForgeWM","asdfo123\u002FForgeWM","asdfo123","Train a real-time, playable Minecraft Video World Models on 8 GPUs — keyboard & mouse control, fully open and reproducible.",null,"Python",61,0,1,3,5,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:04:30","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fbanner.png\" width=\"700\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>Train a real-time, playable Minecraft world model on 8 GPUs —  keyboard & mouse control, fully open and reproducible.\u003C\u002Fb>\n\u003C\u002Fp>\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-green\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGPUs-8×H20-orange\">\n  \u003Ca href=\"https:\u002F\u002Fasdfo123.github.io\u002FForgeWM\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌐%20Project-Page-blue\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fasdfo123\u002FForgeWM\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-Models-yellow\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fasdfo123\u002FForgeWM-data\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-Data-yellow\">\u003C\u002Fa>\n  \u003Ca href=\"你的arXiv链接\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-coming%20soon-red\">\u003C\u002Fa>\n  \u003Ca href=\"assets\u002Fwechat.JPG\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-Group-07C160?logo=wechat&logoColor=white\">\u003C\u002Fa>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fasdfo123.github.io\u002FForgeWM\u002F\">Project Page\u003C\u002Fa> •\n  \u003Ca href=\"#results\">Results\u003C\u002Fa> •\n  \u003Ca href=\"#quick-start\">Quick Start\u003C\u002Fa> •\n  \u003Ca href=\"#training-pipeline\">Training\u003C\u002Fa> •\n  \u003Ca href=\"#acknowledgements\">Acknowledgements\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n## About\n\nForgeWM is an open-source framework for training interactive world models that respond to keyboard and mouse inputs. We integrate [Matrix-Game 2](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FMatrix-Game)'s game-native I2V backbone, [GameFactory](https:\u002F\u002Fgithub.com\u002FKlingAIResearch\u002FGameFactory)'s open Minecraft data, and the [Causal Forcing](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FCausal-Forcing) distillation pipeline into an end-to-end system reproducible on 8 GPUs.\n\n\n### Why does this exist?\n\nMatrix-Game 2 open-sourced the weights — but not the training data or the training code.\n\nCausal Forcing (and Causal Forcing++) gave the community a strong distillation paradigm, and [minWM](https:\u002F\u002Fgithub.com\u002Fshengshu-ai\u002FminWM) provides an excellent open reference for it on camera-controlled video. But that line targets continuous camera trajectories on general T2V\u002FTI2V backbones — not the discrete keyboard-and-mouse control that interactive games actually use.\n\nForgeWM fills the remaining gap: a fully open, end-to-end pipeline that brings Causal Forcing to discrete-action, game-native world models — built on the MG2 lineage, trained on open GameFactory data, reproducible on 8 GPUs.\n\n---\n\n## Results\n\n### ForgeWM (4-step DMD) vs Matrix-Game 2 (Self-Forcing Distillation)\n\nSame reference frame, same action. Left: MG2 official distilled model. Right: ForgeWM Stage 3.\n\n| Scene | Matrix-Game 2 | ForgeWM |\n|-------|--------------|---------|\n| Forest (turn right) | \u003Cimg src=\"assets\u002Fresults\u002Fmg2_forest_turn_right.gif\" width=\"320\"> | \u003Cimg src=\"assets\u002Fresults\u002Fforge_forest_turn_right.gif\" width=\"320\"> |\n| Plains (forward) | \u003Cimg src=\"assets\u002Fresults\u002Fmg2_plains_forward.gif\" width=\"320\"> | \u003Cimg src=\"assets\u002Fresults\u002Fforge_plains_forward.gif\" width=\"320\"> |\n| Underwater Cave (forward) | \u003Cimg src=\"assets\u002Fresults\u002Fmg2_cave_forward.gif\" width=\"320\"> | \u003Cimg src=\"assets\u002Fresults\u002Fforge_cave_forward.gif\" width=\"320\"> |\n| Desert (back) | \u003Cimg src=\"assets\u002Fresults\u002Fmg2_desert_back.gif\" width=\"320\"> | \u003Cimg src=\"assets\u002Fresults\u002Fforge_desert_back.gif\" width=\"320\"> |\n| Rainy sunset (random) | \u003Cimg src=\"assets\u002Fresults\u002Fmg2_night_random.gif\" width=\"320\"> | \u003Cimg src=\"assets\u002Fresults\u002Fforge_night_random.gif\" width=\"320\"> |\n| Rainy night (forward) | \u003Cimg src=\"assets\u002Fresults\u002Fmg2_sunset_forward.gif\" width=\"320\"> | \u003Cimg src=\"assets\u002Fresults\u002Fforge_sunset_forward.gif\" width=\"320\"> |\n\n**Observations:**\n\n- **Overall quality**: ForgeWM largely reproduces MG2's generation quality at 4-step inference. Temporal smoothness is slightly better; fine-grained texture detail is slightly weaker (likely due to smaller training data: GameFactory ~70h vs MG2's proprietary dataset).\n- **\"Underwater\" artifact fixed**: MG2's original model tends to drift into underwater\u002Focean textures when encountering rain, blue sky, or dark scenes (rows 4–6) — likely caused by an over-representation of ocean footage in its proprietary training data. ForgeWM, trained on GameFactory's balanced action distribution, does not exhibit this failure mode.\n- **Action controllability**: Both models respond correctly to keyboard\u002Fmouse inputs. ForgeWM's Causal Forcing distillation preserves action fidelity through all 4 stages.\n- At the official 360p inference setting, we observed that MG2's HUD elements (e.g. the hotbar) gradually shrink over a rollout — a possible train\u002Finference resolution mismatch. ForgeWM does not show this under our setting.\n\n\n> Both models use 4-step inference at 352×640. MG2 uses the official Self-Forcing distilled checkpoint; ForgeWM trains from scratch on open GameFactory data with Causal Forcing.\n\n---\n\n## Comparison\n\n| Project | Base Model | Control | Paradigm | I2V | Data Open | Train Code |\n|---------|-----------|---------|----------|-----|-----------|------------|\n| **ForgeWM** | Wan2.1-1.3B | Keyboard + Mouse | Causal Forcing | ✅ | ✅ GameFactory | ✅ |\n| MG2 (Skywork) | Wan2.1-1.3B | Keyboard + Mouse | Self Forcing | ✅ | ❌ | ❌ (inference only) |\n| minWM | HY1.5 \u002F Wan2.1 | Camera pose | Causal Forcing | HY only | ✅ (camera data) | ✅ |\n\n> minWM's HY15 line supports TI2V (text+image→video); the Wan2.1 line is T2V+camera only. Their open data is camera-trajectory based, not game-specific keyboard\u002Fmouse actions.\n\n---\n\n## Quick Start\n\n### Prerequisites\n\n```bash\npip install -r requirements.txt\npip install flash-attn --no-build-isolation\n```\n\n### Download Models & Data\n\n```bash\n# MG2 base model (~9 GB)\nbash scripts\u002Fdownload_models.sh\n\n# ForgeWM checkpoints (all 4 stages: stage0 \u002F stage1 \u002F stage2 \u002F stage3)\nhuggingface-cli download asdfo123\u002FForgeWM --local-dir .\u002Fckpts --repo-type model\n\n# Training data (pre-encoded 360p LMDB, ~89 GB)\nhuggingface-cli download asdfo123\u002FForgeWM-data --local-dir .\u002Fdata\u002Faction_lmdb --repo-type dataset\n```\n\n### Inference (Single GPU)\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python inference.py \\\n    --checkpoint_path ckpts\u002Fstage3\u002Fmodel.pt \\\n    --image_path demo_images\u002Fforest.png \\\n    --action_type forward \\\n    --num_frames 21 \\\n    --output_path output\u002Fdemo.mp4\n```\n\nSupported actions: `forward`, `back`, `turn_right`, `turn_left`, `look_up`, `look_down`, `left`, `right`, `random`, `no_action`\n\n---\n\n## Training Pipeline\n\n4-stage progressive distillation, each stage builds on the previous:\n\n| Stage | Method | Steps (8×H20) |\n|-------|--------|--------------|\n| 0 | Bidirectional SFT (domain adaptation) | 4000 |\n| 1 | Teacher-Forcing Causal AR | 10000 |\n| 2 | Consistency Distillation | 6000 |\n| 3 | DMD (4-step real-time) | 2000 |\n\n```bash\n# Full pipeline\ntorchrun --nproc_per_node=8 train.py --config_path configs\u002Fstage0_bid_sft.yaml --logdir logs\u002Fstage0\ntorchrun --nproc_per_node=8 train.py --config_path configs\u002Fstage1_teacher_forcing.yaml --logdir logs\u002Fstage1\ntorchrun --nproc_per_node=8 train.py --config_path configs\u002Fstage2_consistency_distillation.yaml --logdir logs\u002Fstage2\ntorchrun --nproc_per_node=8 train.py --config_path configs\u002Fstage3_dmd.yaml --logdir logs\u002Fstage3\n```\n\n---\n\n## Data Preparation\n\nYou can either download the pre-encoded LMDB directly, or build it yourself from the raw GameFactory dataset.\n\n### Option A: Download pre-encoded data (recommended)\n\n```bash\nhuggingface-cli download asdfo123\u002FForgeWM-data --local-dir .\u002Fdata\u002Faction_lmdb --repo-type dataset\n```\n\n### Option B: Build from GF-Minecraft\n\nRequires the [GameFactory](https:\u002F\u002Fgithub.com\u002FKlingAIResearch\u002FGameFactory) GF-Minecraft dataset (~70h gameplay videos + action labels).\n\n```bash\n# 1. Download GF-Minecraft (see GameFactory repo for instructions)\n#    Expected structure: data_2003\u002Fvideo\u002F*.mp4 + data_2003\u002Fmetadata\u002F*.json\n\n# 2. Encode into sharded LMDB (8 GPUs, ~2-3 hours)\nGF_DATA=\u002Fpath\u002Fto\u002FGF-Minecraft\u002Fdata_2003 bash scripts\u002Fprepare_data_all.sh\n```\n\nThe script:\n- Resizes videos to 352×640 (aspect-preserving crop)\n- Encodes through Wan2.1 VAE → latent (21, 16, 44, 80) per clip\n- Flips pitch sign (GF: +pitch = look-down → MG2: mouse[0] > 0 = look-up)\n- Parses keyboard into 4-dim one-hot (W\u002FS\u002FA\u002FD)\n- Outputs 10 shards × 4000 clips = 40,000 training clips (~89 GB total)\n\n---\n\n## Architecture\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Farchitecture.png\" >\n\u003C\u002Fp>\n\n> **Architecture is identical to [Matrix-Game 2](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FMatrix-Game)** — derived from WanX by removing the text branch and adding a hybrid action module (keyboard cross-attention + mouse channel concat). *Figure adapted from the Matrix-Game 2 paper.*\n\n### I2V Conditioning (First-Frame Fidelity)\n\nUnlike T2V models that generate from text alone, ForgeWM uses a three-pathway image conditioning mechanism inherited from Matrix-Game 2:\n\n1. **Channel-concat**: The first frame is VAE-encoded and concatenated channel-wise with the noise input (`cond_concat = [4-ch mask | 16-ch img_latent]`, 20 channels total). A binary mask marks frame 0 as \"real\" and subsequent frames as \"to generate\". This gives the model pixel-level reference for the opening frame.\n2. **CLIP visual context**: The first frame is separately encoded through a CLIP vision encoder into a 257-token sequence, injected via cross-attention at every transformer block. This provides high-level semantic guidance (scene type, lighting, objects) that persists across the entire generation.\n3. **Causal history**: During autoregressive rollout, previously generated (clean) frames are cached in the KV store.\n\n\n### Action Injection\n\n- **Keyboard (discrete)**: Cross-attention injection into each transformer block — keyboard actions are embedded and attend to latent frame tokens\n- **Mouse (continuous)**: Concatenation with sliding-window grouping (VAE temporal compression ratio = 4) — continuous deltas are grouped per-frame and concatenated with latent features\n\n### Temporal Architecture\n\n- **Block-wise causal attention**: frames are grouped into chunks of `num_frame_per_block=3`; within a chunk, attention is bidirectional; across chunks, strictly causal\n- **Sliding window** (`local_attn_size=6`): each chunk only attends to the 6 most recent frames, enabling unbounded-length generation at inference without memory growth\n\n---\n\n## Roadmap\n\n### Released\n- ✅ 4-stage training pipeline (Bid SFT → TF AR → CD → DMD)\n- ✅ Action-conditioned inference\n- ✅ All 4 stage checkpoints — Stage 0 \u002F 1 \u002F 2 \u002F 3 ([HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fasdfo123\u002FForgeWM))\n- ✅ Pre-encoded training data ([HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fasdfo123\u002FForgeWM-data))\n\n### In progress\n- 🚧 Interactive real-time demo\n- 🚧 Tech report\n\n### Future \u002F community\n- 💭 Multi-game support beyond Minecraft (FPS, racing, etc.)\n- 💭 Larger backbones (Wan2.2-5B, HY1.5)\n- 💭 Open to PRs & Collaboration.\n\n---\n\n## Acknowledgements\n\nForgeWM integrates work from multiple research groups:\n\n| Component | Source |\n|-----------|--------|\n| Base model | [Matrix-Game 2](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FMatrix-Game) |\n| Training data | [GameFactory](https:\u002F\u002Fgithub.com\u002FKlingAIResearch\u002FGameFactory) |\n| Distillation | [Causal Forcing](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FCausal-Forcing) |\n\nWe also thank the authors of:\n- [Self-Forcing](https:\u002F\u002Fgithub.com\u002Fguandeh17\u002FSelf-Forcing)\n- [CausVid](https:\u002F\u002Fgithub.com\u002Ftianweiy\u002FCausVid)\n- [Wan 2.1](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1)\n- [minWM](https:\u002F\u002Fgithub.com\u002Fshengshu-ai\u002FminWM)\n- [GameCraft](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FHunyuan-GameCraft-1.0)\n- [HunyuanVideo](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FHunyuanVideo-1.5)\n\n---\n\n## Contact\n\n- Email: leeasdfo123@gmail.com\n- [WeChat Group](.\u002Fassets\u002Fwechat.JPG)[\u003Cimg src=\"assets\u002Fwechat.JPG\" width=\"200\">](assets\u002Fwechat.JPG)\n\n---\n\n## Citation\n\n```bibtex\n@misc{forgewm2026,\n  title={ForgeWM: A Reproducible Training Recipe for Action-Controllable World Models},\n  author={ForgeWM Team},\n  year={2026},\n  url={https:\u002F\u002Fgithub.com\u002Fasdfo123\u002FForgeWM}\n}\n```\n\n---\n\n## License\n\nApache License 2.0 — see [LICENSE](LICENSE).\n","ForgeWM 是一个用于训练可交互的Minecraft世界模型的开源框架，支持键盘和鼠标控制。该项目集成了Matrix-Game 2的游戏原生I2V骨干、GameFactory的开放Minecraft数据以及Causal Forcing的蒸馏管线，形成一个可在8个GPU上完全复现的端到端系统。其核心功能包括实时响应玩家输入生成游戏画面，通过结合先进的机器学习技术实现了高度的真实感与互动性。适用于需要在虚拟环境中进行复杂人机交互的研究场景，如游戏AI开发、虚拟现实体验增强等。",2,"2026-06-11 04:09:53","CREATED_QUERY"]