[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80656":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":22,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":15,"lastSyncTime":27,"discoverSource":28},80656,"SCOPE","z2tong\u002FSCOPE","z2tong","SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models",null,"Python",72,7,50,0,2,23,1,2.71,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:04:05","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fteaser.jpg\">\n\n\u003Ch1>SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models\u003C\u002Fh1>\n\n**SCOPE** is an interactive world model for FPS games with 10-DoF action control, trained on 69K clips across 7 games.\n\n\u003C\u002Fdiv>\n\n\n\u003Cdiv align=\"center\">\n\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject%20Page-SCOPE-blue)](https:\u002F\u002Fz2tong.github.io\u002FSCOPE\u002F)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-red?logo=arxiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.23345)\n[![Model](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=%F0%9F%A4%97%20Model&message=HuggingFace&color=yellow)](https:\u002F\u002Fhuggingface.co\u002Fzizhaotong\u002FSCOPE)\n[![Dataset](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=%F0%9F%A4%97%20Dataset&message=CrossFPS&color=orange)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fzizhaotong\u002Fcrossfps)\n[![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-SCOPE-purple?logo=modelscope)](https:\u002F\u002Fwww.modelscope.cn\u002Fcollections\u002Fzztong\u002FSCOPE)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green)](LICENSE)\n\n\n\u003C\u002Fdiv>\n\n-----\n\nWe are excited to introduce **SCOPE**, an open-source interactive world model for first-person shooter (FPS) games. Positioned as a top-tier action-conditioned world model, it offers the following features.\n- **Hybrid Action Space**: Jointly processes continuous (4D dual-joystick) and discrete (6 binary buttons) control signals within a unified framework — the first FPS world model to do so.\n- **Dense Per-Frame Conditioning**: Resolves overlapping actions at every single frame, enabling simultaneous multi-action composition (e.g., moving + aiming + firing) that reflects real gameplay complexity.\n- **Cross-Game Generalization**: Trained on 7 diverse FPS titles, a single model generalizes zero-shot to entirely unseen game environments without fine-tuning.\n- **In-Scope \u002F Out-of-Scope Decoupling**: Spatially selective conditioning that separates localized in-scope effects (weapon recoil, HUD) from stable out-of-scope world generation — without any segmentation labels.\n\n## 🎬 Demo Results\n\n\u003Cdiv align=\"center\">\n  \u003Ctable>\n    \u003Ctr>\n      \u003Ctd align=\"center\">\u003Cimg src=\"assets\u002Fdemo_it_takes_two.gif\" width=\"380\">\u003Cbr>\u003Csub>It Takes Two\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cimg src=\"assets\u002Fdemo_genshin_impact.gif\" width=\"380\">\u003Cbr>\u003Csub>Genshin Impact\u003C\u002Fsub>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd align=\"center\">\u003Cimg src=\"assets\u002Fdemo_black_myth_wukong.gif\" width=\"380\">\u003Cbr>\u003Csub>Black Myth: Wukong\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cimg src=\"assets\u002Fdemo_desert.gif\" width=\"380\">\u003Cbr>\u003Csub>Desert\u003C\u002Fsub>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n> Generated at 480×832 resolution, 81 frames @ 20 FPS (~4 seconds), conditioned on 10-DoF action inputs.\n\n## 🔥 News\n- May 2026: 🎉 We release the model weights, inference code, and CrossFPS dataset.\n\n## ⚙️ Quick Start\n\nThis codebase is built upon [Wan2.2](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2). Please refer to their documentation for environment setup.\n\n### Installation\n\nClone the repo:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fz2tong\u002FSCOPE.git\ncd SCOPE\n```\n\nInstall dependencies (we recommend [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) for fast, reproducible environment setup):\n```bash\n# Using uv (recommended)\nuv venv --python 3.10\nsource .venv\u002Fbin\u002Factivate\nuv pip install -r requirements.txt\nuv pip install -e .\n\n# Or using conda + pip\nconda create -n scope python=3.10 && conda activate scope\npip install -r requirements.txt\npip install -e .\n```\n\nInstall [`flash_attn`](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) (recommended for faster inference):\n```bash\npip install ninja\npip install flash-attn --no-build-isolation\n```\n\n### Model Download\n\nAll required weights (DiT + Text Encoder + VAE + Tokenizer) are packaged in a single HuggingFace repo:\n\n```bash\npip install \"huggingface_hub[cli]\"\nhuggingface-cli download zizhaotong\u002FSCOPE --local-dir .\u002FSCOPE\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Directory layout after download\u003C\u002Fb>\u003C\u002Fsummary>\n\n```\nSCOPE\u002F\n├── model-00001-of-00003.safetensors       # DiT shard 1 (~5.0 GB)\n├── model-00002-of-00003.safetensors       # DiT shard 2 (~5.0 GB)\n├── model-00003-of-00003.safetensors       # DiT shard 3 (~4.6 GB)\n├── model.safetensors.index.json           # Shard index\n├── models_t5_umt5-xxl-enc-bf16.pth       # Text Encoder (UMT5-XXL, ~20 GB)\n├── Wan2.2_VAE.pth                         # Video VAE (~700 MB)\n├── google\u002Fumt5-xxl\u002F                       # Tokenizer\n│   ├── config.json\n│   ├── spiece.model\n│   └── tokenizer_config.json\n└── config.json                            # Model config\n```\n\n\u003C\u002Fdetails>\n\n### Inference\n\n**Single image + action:**\n\n```bash\npython inference.py \\\n    --model_dir .\u002FSCOPE \\\n    --input_image examples\u002Fexample_0\u002Fimage.png \\\n    --action_path examples\u002Fexample_0\u002Faction.parquet \\\n    --prompt \"In a whimsical, toy-inspired garden, the first-person view reveals a tactical weapon aimed forward along a sunlit path.\" \\\n    --output_dir .\u002Foutputs\n```\n\n**Batch processing (directory of images):**\n\n```bash\npython inference.py \\\n    --model_dir .\u002FSCOPE \\\n    --input_image_dir .\u002Fmy_images \\\n    --action_path examples\u002Fexample_1\u002Faction.parquet \\\n    --prompt \"A breathtaking view of a vibrant fantasy world, seen through an FPS perspective.\" \\\n    --output_dir .\u002Foutputs\n```\n\n### Examples\n\nWe provide 3 ready-to-run examples in `examples\u002F`:\n\n| Example | Scene | Command |\n|---------|-------|---------|\n| `example_0` | It Takes Two | `python inference.py --model_dir .\u002FSCOPE --input_image examples\u002Fexample_0\u002Fimage.png --action_path examples\u002Fexample_0\u002Faction.parquet --prompt \"$(cat examples\u002Fexample_0\u002Fprompt.txt)\"` |\n| `example_1` | Genshin Impact | `python inference.py --model_dir .\u002FSCOPE --input_image examples\u002Fexample_1\u002Fimage.png --action_path examples\u002Fexample_1\u002Faction.parquet --prompt \"$(cat examples\u002Fexample_1\u002Fprompt.txt)\"` |\n| `example_2` | Black Myth: Wukong | `python inference.py --model_dir .\u002FSCOPE --input_image examples\u002Fexample_2\u002Fimage.png --action_path examples\u002Fexample_2\u002Faction.parquet --prompt \"$(cat examples\u002Fexample_2\u002Fprompt.txt)\"` |\n\n**Tips**: If you have sufficient CUDA memory, you may increase the `--max_frames` parameter (e.g., to 161) to generate longer videos. The frame count must satisfy `n % 4 == 1`.\n\n### Arguments\n\n| Argument | Default | Description |\n|----------|---------|-------------|\n| `--model_dir` | *(required)* | Path to model directory containing all weights |\n| `--input_image` | None | Single input image (first frame) |\n| `--input_image_dir` | None | Directory of images for batch mode |\n| `--action_path` | *(required)* | Action signal file (`.parquet`) |\n| `--prompt` | `\"\"` | Text prompt describing the scene |\n| `--output_dir` | `.\u002Foutputs` | Output directory |\n| `--height` | 480 | Video height (pixels) |\n| `--width` | 832 | Video width (pixels) |\n| `--max_frames` | 81 | Max frames (must satisfy `n % 4 == 1`) |\n| `--num_inference_steps` | 30 | Diffusion denoising steps |\n| `--seed` | 0 | Random seed for reproducibility |\n\n## 🎮 Action Signal Format\n\nThe `.parquet` file defines per-frame player inputs. Each row corresponds to one raw video frame (81 rows = 81 frames).\n\n### Controller Buttons (6D binary: 0 or 1)\n\n| Column | Game Action | Controller Button |\n|--------|-------------|-------------------|\n| `right_trigger` | Fire | RT |\n| `left_trigger` | Aim Down Sights | LT |\n| `south` | Jump | A |\n| `right_thumb` | Melee | R3 |\n| `west` | Reload | X |\n| `north` | Weapon Switch | Y |\n\n### Dual Joystick (4D continuous)\n\n| Column | Axes | Function |\n|--------|------|----------|\n| `j_left` | `[x, y]` | Character movement (left stick) |\n| `j_right` | `[x, y]` | Camera\u002Faim rotation (right stick) |\n\n> **Note**: The model handles 4× temporal compression internally via the VAE. Action sequences should have one entry per raw frame (not per latent frame).\n\n## 🏗️ Model Architecture\n\nBuilt on [Wan2.2-TI2V-5B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B) (~5B parameters, BFloat16), the model inserts an `ActionModule` into each of the 30 DiT transformer blocks, enabling per-pixel action-conditioned video generation at 480×832 resolution, 81 frames @ 20 FPS (~4 seconds).\n\nEach `ActionModule` contains two conditioning paths:\n- **Mouse\u002FJoystick Path**: Sliding-window temporal features → MLP fusion → pixel-wise temporal self-attention with RoPE\n- **Keyboard\u002FButton Path**: Button embedding → temporal windowing → cross-attention (video queries, keyboard keys\u002Fvalues)\n\nBoth output projections are zero-initialized for stable residual training on top of frozen pretrained weights. For detailed architecture specifications, see the [model card on HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fzizhaotong\u002FSCOPE).\n\n## 📁 Project Structure\n\n```\nSCOPE\u002F\n├── inference.py                  # Inference entry point\n├── pyproject.toml                # Package configuration\n├── README.md\n├── assets\u002F                       # Documentation assets\n├── examples\u002F                     # Ready-to-run inference examples\n│   ├── example_0\u002F               # It Takes Two\n│   ├── example_1\u002F               # Genshin Impact\n│   └── example_2\u002F               # Black Myth: Wukong\n└── diffsynth\u002F                    # Core inference library\n    ├── models\u002F\n    │   └── scope_dit.py          # DiT with ActionModule\n    ├── pipelines\u002F\n    │   └── scope_pipeline.py     # Video generation pipeline\n    ├── diffusion\u002F                # Flow-matching scheduler\n    ├── core\u002F                     # Model loading & VRAM management\n    └── utils\u002F                    # Video I\u002FO utilities\n```\n\n## 💻 Hardware Requirements\n\n| Resource | Minimum | Recommended |\n|----------|---------|-------------|\n| GPU VRAM | 24 GB (with CPU offload) | 80 GB (A100\u002FH200) |\n| System RAM | 32 GB | 64 GB |\n| Disk Space | ~45 GB | — |\n\n## 📊 CrossFPS Dataset\n\nThe model is trained on [**CrossFPS**](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fzizhaotong\u002Fcrossfps), the first multi-game FPS dataset with frame-aligned action telemetry:\n\n| Property | Value |\n|----------|-------|\n| Games | 7 diverse FPS titles |\n| Total Clips | 69,000+ |\n| Action Dimensions | 10-DoF (6 buttons + 4D joystick) |\n| Annotation | Frame-aligned action telemetry |\n| Curation | Gameplay-bias removal for general visual-to-action mapping |\n\n## 📜 License\n\nThis project is licensed under the [Apache License 2.0](LICENSE).\n\n## ✨ Acknowledgements\n\nWe would like to express our gratitude to the following open-source projects for their invaluable contributions:\n- [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio) — Base diffusion framework\n- [Wan2.2](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B) — Pretrained video generation model\n- [UMT5-XXL](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fumt5-xxl) — Multilingual text encoder\n\n## 📖 Citation\n\nIf you find this work useful for your research, please cite our paper:\n\n```bibtex\n@misc{scope2026,\n      title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},\n      author={Zizhao Tong and Hongfeng Lai and Zeqing Wang and Zhaohu Xing and Kexu Cheng and Haoran Xu and Zhao Pu and Shangwen Zhu and Ruili Feng and Jian Zhao and Yan Zhang and Hao Tang and Yeying Jin and Ling Shao},\n      year={2026},\n      eprint={2605.23345},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.23345}, \n}\n```\n","SCOPE 是一个用于第一人称射击（FPS）游戏的交互式世界模型，支持10自由度动作控制。该项目通过处理连续与离散控制信号的混合动作空间、每帧密集条件以及跨游戏泛化能力等核心功能，能够模拟复杂的真实游戏环境。特别地，SCOPE 能够在未见过的游戏环境中无需微调即可工作，并且能够在生成过程中区分局部效果和稳定的世界背景。它适用于需要高度互动性和真实感的虚拟环境构建场景，如游戏开发中的AI行为预测或训练、虚拟现实体验设计等领域。","2026-06-11 04:01:32","CREATED_QUERY"]