[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80095":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":12,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":13,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":25,"discoverSource":26},80095,"WorldVLN.code","EmbodiedCity\u002FWorldVLN.code","EmbodiedCity",null,"Python",77,7,3,2,0,17,2.71,"Creative Commons Attribution 4.0 International",false,"main",true,[],"2026-06-12 02:03:58","# WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.15964-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.15964)\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Website-blue.svg)](https:\u002F\u002Fembodiedcity.github.io\u002FWorldVLN\u002F)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-Model%20weights-yellow.svg)](https:\u002F\u002Fhuggingface.co\u002FEmbodiedCity\u002FWorldVLN)\n\nThis repository provides the reference implementation for WorldVLN. It includes an autoregressive inference for closed-loop action prediction, as well as two-stage training pipelines: (1) supervised backbone and action decoder training, and (2) action-aware GRPO–based optimization.\n\n\n## Installation\n\nWe recommend using a single Python 3.10 environment for the released workflows. In our validated launch scripts, the Python interpreter is passed explicitly through `PYTHON_BIN`, so after activating your environment it is recommended to export:\n\n```bash\nexport PYTHON_BIN=$(which python)\n```\n\n### Recommended Environment\n\n1. Create a Python 3.10 environment.\n\n```bash\nconda create -n worldvln python=3.10\nconda activate worldvln\n```\n\n2. Install a PyTorch build that matches your CUDA environment. For the released training and action-aware GRPO workflows, a PyTorch 2.5.1 environment is the recommended baseline.\n\n3. Install the shared dependencies used by the released workflows.\n\n```bash\npip install -r requirements.txt\n```\n\n## Setup\n\n### Model Weights\n\nOfficial WorldVLN weights are available on Hugging Face:\n\n- [WorldVLN weights](https:\u002F\u002Fhuggingface.co\u002FEmbodiedCity\u002FWorldVLN)\n\nDownload the weights to your preferred checkpoint directory and configure the relevant training or inference scripts to point to them.\n\nSimulation \u002F benchmark resources:\n\n- IndoorUAV simulation environment download guide: [Indoor_UAV](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002Fvalyentine\u002FIndoor_UAV)\n- UAV-Flow benchmark and evaluation environment: [buaa-colalab\u002FUAV-Flow](https:\u002F\u002Fgithub.com\u002Fbuaa-colalab\u002FUAV-Flow)\n\n## Inference\n\nThe repository currently provides two main inference entry points.\n\n![WorldVLN model](.\u002Fassets\u002Fmodel.png)\n\n### Online Inference Service\n\n#### Quick start\n\nFrom the repository root:\n\n```bash\nexport PYTHON_BIN=$(which python)\nexport INFINITY_CKPT=\u002Fpath\u002Fto\u002Finfinity\u002Fglobal_step_xxx.pth\nexport STAGE2_LATENT2ACTION_CKPT=\u002Fpath\u002Fto\u002Fstage2_latent2action_combined.pt\n\nbash infer\u002Frun_server.sh\n```\n\nCommon environment variables:\n\n- `INFINITY_CKPT`: main InfinityStar \u002F WorldVLN checkpoint used by the service\n- `STAGE2_LATENT2ACTION_CKPT`: Stage-2 latent-to-action checkpoint for action prediction\n- `INFINITY_SERVER_CONFIG`: optional override for `infer\u002Fconfig.json`\n- `INFINITY_REPO_ROOT`: optional override for the default `Worldmodel\u002Fruntime\u002F`\n- `INFINITY_LATENT_CACHE_ROOT`: runtime cache directory used by the service\n- `HOST`, `PORT`: bind address for Uvicorn\n\n#### Details\n\n- Entry points: [infer\u002Frun_server.sh](.\u002Finfer\u002Frun_server.sh), [infer\u002Fserver.py](.\u002Finfer\u002Fserver.py)\n- Configuration: [infer\u002Fconfig.json](.\u002Finfer\u002Fconfig.json)\n- Windows-side client: [infer\u002Fclient.py](.\u002Finfer\u002Fclient.py)\n\n#### Autoregressive I\u002FO contract (how clients talk to the server)\n\nWorldVLN runs in an **autoregressive** closed-loop protocol over a trajectory `session_id`:\n\n- **Input (per call)**: `images_base64` (RGB frames) + an optional `instruction` on the first call.\n  - First call typically sends **1 warmup frame**.\n  - Next calls send **`step` frames** (default 16) to advance the session timeline.\n- **State (server-side)**: the server stores history by `session_id` and maintains a streaming world-model session.\n- **Output (per call)**: `actions` as delta actions in **cm\u002Fdeg** with order `[dx, dy, dz, droll, dyaw, dpitch]`.\n  - In the default `tsformer_latent` mode, the server emits **one segment worth of actions** whenever enough frames have been received.\n\nExample (strict closed-loop, enable `allow_future_segments=1`):\n\n- Send **(1 real frame + instruction\u002Fprompt)** → get the next **`step` actions** (typically 16).\n- Execute them, collect the next **`step` real frames**, send them → get the next **`step` actions**.\n- Repeat with the same `session_id` until the trajectory ends.\n\nClients in this repo follow the same pattern:\n\n- **`infer\u002Fclient.py`** (simulation \u002F dataset example): sends `1, step, step, ...` frames under a stable `session_id`, and writes per-segment `*_actions.json` \u002F `*_poses.json`.\n- **`action_aware_grpo\u002Fwindows_client.py`** (Windows-side rollout integration \u002F debugging): uses the same session protocol and output format, but is packaged as a Win-friendly client for action-aware GRPO workflows.\n\n## Training\n\nThis repository is organized into two stages:\n\n- **Stage 1 (supervised)**: backbone finetuning + action decoder training.\n- **Stage 2 (action-aware GRPO)**: rollout collection + GRPO training.\n\n![WorldVLN framework](.\u002Fassets\u002Fframework.png)\n\n### Stage 1: Supervised Training\n\n#### Backbone Training\n\n#### Quick start\n\n```bash\nbash train\u002Fscripts\u002Ftrain_from_base.sh\n```\n\n#### Details\n\nThe backbone finetuning workflow is located under [train\u002F](.\u002Ftrain).\n\n- Entry point: [train\u002Fscripts\u002Ftrain_from_base.sh](.\u002Ftrain\u002Fscripts\u002Ftrain_from_base.sh)\n- Main trainer: [train\u002Ftrain.py](.\u002Ftrain\u002Ftrain.py)\n- Detailed guide: [train\u002FTRAINING.md](.\u002Ftrain\u002FTRAINING.md)\n\n#### Action Decoder Training\n\n#### Quick start (Stage A + Stage B)\n\n```bash\n# Stage A: adapter distillation\nbash train\u002Faction_decoder\u002Fscripts\u002Ftrain_stageA_ddp.sh\n\n# Stage B: latent-to-action training\nbash train\u002Faction_decoder\u002Fscripts\u002Ftrain_stageB_ddp.sh\n```\n\n#### Details\n\nThe action decoder workflow is located under [Worldmodel\u002Faction_decoder\u002Fsrc\u002F](.\u002FWorldmodel\u002Faction_decoder\u002Fsrc) and is organized into two steps (Stage A + Stage B).\n\nThe action decoder training entrypoints live under [train\u002Faction_decoder\u002F](.\u002Ftrain\u002Faction_decoder) and are organized into two steps:\n\n- Stage A adapter distillation: [train\u002Faction_decoder\u002Fscripts\u002Ftrain_stageA_ddp.sh](.\u002Ftrain\u002Faction_decoder\u002Fscripts\u002Ftrain_stageA_ddp.sh)\n- Stage B latent-to-action training: [train\u002Faction_decoder\u002Fscripts\u002Ftrain_stageB_ddp.sh](.\u002Ftrain\u002Faction_decoder\u002Fscripts\u002Ftrain_stageB_ddp.sh)\n- Main scripts: [train\u002Faction_decoder\u002Ftools\u002Ftrain_stageA_ddp.py](.\u002Ftrain\u002Faction_decoder\u002Ftools\u002Ftrain_stageA_ddp.py), [train\u002Faction_decoder\u002Ftools\u002Ftrain_stageB_ddp.py](.\u002Ftrain\u002Faction_decoder\u002Ftools\u002Ftrain_stageB_ddp.py)\n\nThis workflow trains the mapping from visual latent features to 6-DoF motion outputs.\n\nData contract (training manifest):\n\n```json\n{\n  \"items_train\": [\n    {\n      \"latent_path\": \"path\u002Fto\u002Flatents.pt\",\n      \"traj_json_path\": \"path\u002Fto\u002Fpreprocessed_logs.json\",\n      \"images_dir\": \"path\u002Fto\u002Fimages\"\n    }\n  ]\n}\n```\n\nStage A required environment variables:\n\n- `MANIFEST_JSON`\n- `TSFORMER_CKPT`\n- `INF_VAE_PATH`\n\nRun Stage A:\n\n```bash\nbash train\u002Faction_decoder\u002Fscripts\u002Ftrain_stageA_ddp.sh\n```\n\nStage B required environment variables:\n\n- `MANIFEST_JSON`\n- `TSFORMER_PRETRAINED`\n- `ADAPTER_CKPT`\n- `INFINITYSTAR_VAE_PATH`\n\nRun Stage B:\n\n```bash\nbash train\u002Faction_decoder\u002Fscripts\u002Ftrain_stageB_ddp.sh\n```\n\n### Stage 2: Action-aware GRPO\n\n#### Quick start (rollout + train)\n\nStart the local inference service used by rollout:\n\n```bash\nINFINITY_CKPT=\u002Fpath\u002Fto\u002Finfinity\u002Fglobal_step_xxx.pth \\\nCHECKPOINTS_DIR=\u002Fpath\u002Fto\u002Fcheckpointsinf \\\nACTIONHEAD_CKPT=\u002Fpath\u002Fto\u002Factionhead\u002Fcheckpoint_last.pth \\\nACTIONHEAD_RUN_CONFIG=\u002Fpath\u002Fto\u002Factionhead\u002Frun_config.json \\\nbash action_aware_grpo\u002Frun_infer_server.sh\n```\n\nRun rollout collection:\n\n```bash\nunset ALL_PROXY all_proxy\nexport NO_PROXY=127.0.0.1,localhost\n\nSRC_JSON=\u002Fpath\u002Fto\u002Freference_video_full_49f_trajectory_prompts.json \\\nINFINITY_CKPT=\u002Fpath\u002Fto\u002Finfinity\u002Fglobal_step_xxx.pth \\\nCHECKPOINTS_DIR=\u002Fpath\u002Fto\u002Fcheckpointsinf \\\nACTIONHEAD_CKPT=\u002Fpath\u002Fto\u002Factionhead\u002Fcheckpoint_last.pth \\\nACTIONHEAD_RUN_CONFIG=\u002Fpath\u002Fto\u002Factionhead\u002Frun_config.json \\\nCUDA_VISIBLE_DEVICES=0 \\\nGRPO_LOCAL_GPU_IDS=0 \\\nNPROC_PER_NODE=1 \\\nNNODES=1 \\\nNODE_RANK=0 \\\nUAVFLOW_STAGEA_ROLLOUT_BACKEND=remote_sim \\\nUAVFLOW_SIMULATOR_BASE_URL=http:\u002F\u002F127.0.0.1:18765 \\\nUAVFLOW_SIMULATOR_TIMEOUT_S=120 \\\nUAVFLOW_TASK_JSON_ROOT=\u002Fpath\u002Fto\u002FUAV-Flow-Eval\u002Ftest_jsons \\\nbash action_aware_grpo\u002Fscripts\u002Frun_stagea_collect.sh RUN_ID=remote_sim_smoke TOP_N=1 K_CAND=1 STAGEA_NPROC=1 STAGEA_PROGRESS_EVERY_N=1\n```\n\nRun train (partial-freeze optimization):\n\n```bash\nCHECKPOINTS_DIR=\u002Fpath\u002Fto\u002Fcheckpointsinf \\\nRUSH_RESUME=\u002Fpath\u002Fto\u002Finfinity\u002Fglobal_step_xxx.pth \\\nREPLAY_META_DIR=\u002Fpath\u002Fto\u002Freplay_meta_rollout_smoke \\\nbash action_aware_grpo\u002Fscripts\u002Frun_stageb_partialfreeze.sh PARTIAL_FREEZE_MODE=smoke RUN_ID=stageb_smoke\n```\n\n#### Details\n\nThe action-aware GRPO workflow is located under [action_aware_grpo\u002F](.\u002Faction_aware_grpo) and is organized into two steps: **rollout** and **train**.\n\n- Server entry point: [action_aware_grpo\u002Fgrpo_server.py](.\u002Faction_aware_grpo\u002Fgrpo_server.py)\n- Windows-side client (for action-aware GRPO rollout integration \u002F debugging): [action_aware_grpo\u002Fwindows_client.py](.\u002Faction_aware_grpo\u002Fwindows_client.py)\n- Rollout collection: [action_aware_grpo\u002Fscripts\u002Frun_stagea_collect.sh](.\u002Faction_aware_grpo\u002Fscripts\u002Frun_stagea_collect.sh)\n- Train (partial-freeze optimization): [action_aware_grpo\u002Fscripts\u002Frun_stageb_partialfreeze.sh](.\u002Faction_aware_grpo\u002Fscripts\u002Frun_stageb_partialfreeze.sh)\n- Remote simulator service wrapper: [action_aware_grpo\u002Fscripts\u002Frun_remote_sim_service.sh](.\u002Faction_aware_grpo\u002Fscripts\u002Frun_remote_sim_service.sh)\n- Local inference launcher used by rollout: [action_aware_grpo\u002Frun_infer_server.sh](.\u002Faction_aware_grpo\u002Frun_infer_server.sh)\n\nAt a high level:\n\n- Rollout consumes rollout sources and model assets, then generates rollout caches and replay metadata.\n- Train consumes replay metadata and runs optimization to produce updated checkpoints and logs.\n\nFor simulator-backed rollout details, see [action_aware_grpo\u002Fdocs\u002Fremote_sim.md](.\u002Faction_aware_grpo\u002Fdocs\u002Fremote_sim.md).\n\n## Acknowledgement\n\nWe sincerely thank the following projects for their exceptional effort: [InfinityStar](https:\u002F\u002Fgithub.com\u002FFoundationVision\u002FInfinityStar), [TSformer-VO](https:\u002F\u002Fgithub.com\u002Faofrancani\u002FTSformer-VO).\n\n## Citation\n\nIf you find this work useful, welcome to cite the WorldVLN paper:\n\n```bibtex\n@misc{zhao2026worldvln,\n      title={WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation}, \n      author={Baining Zhao and Jiacheng Xu and Weicheng Feng and Xin Zhang and Zhaolu Wang and Haoyang Wang and Shilong Ji and Ziyou Wang and Jianjie Fang and Zhiheng Zheng and Weichen Zhang and Yu Shang and Wei Wu and Chen Gao and Xinlei Chen and Yong Li},\n      year={2026},\n      eprint={2605.15964},\n      archivePrefix={arXiv},\n      primaryClass={cs.RO},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.15964}, \n}\n```\n\n## License\n\nThis project is released under the **CC BY 4.0** license. See `LICENSE`.\n","WorldVLN是一个用于空中视觉-语言导航的自回归世界动作模型。该项目利用Python实现，其核心功能包括自回归推理以实现闭环动作预测，以及两阶段训练流程：一是监督主干网络和动作解码器的训练，二是基于GRPO的动作感知优化。技术上，WorldVLN依赖于PyTorch框架，并且提供了预训练模型权重供用户下载使用。此项目非常适合那些需要在复杂环境中通过自然语言指令控制无人机完成特定任务的研究者或开发者，如室内无人机导航、搜救行动等应用场景。","2026-06-11 03:59:14","CREATED_QUERY"]