[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76090":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":8,"pushedAt":8,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":27,"discoverSource":28},76090,"onevl","xiaomi-research\u002Fonevl","xiaomi-research",null,"Python",424,47,3,2,0,5,25,92,15,5.04,false,"main",true,[],"2026-06-12 02:03:39","\u003Cdiv align=\"center\">\n\n# \u003Cimg src=\"assets\u002Fonevl_logo_new.png\" alt=\"OneVL Logo\" height=\"48\" style=\"vertical-align:middle\"\u002F> OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanations\n\n[![Tech Report](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTech%20Report-arXiv-red?style=flat-square&logo=arxiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.18486\u002F)\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject%20Page-blue?style=flat-square&logo=googlechrome)](https:\u002F\u002Fxiaomi-embodied-intelligence.github.io\u002FOneVL\u002F)\n[![Model Weights](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Weights-HuggingFace-yellow?style=flat-square&logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fxiaomi-research\u002Fonevl-models\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green?style=flat-square)](LICENSE)\n\n\u003C\u002Fdiv>\n\n---\n\n## Overview\n\n**OneVL** is a Vision-Language-Action (VLA) framework for autonomous driving that achieves **state-of-the-art trajectory prediction accuracy** with **inference latency matching answer-only AR models**. It overcomes the fundamental limitations of prior latent Chain-of-Thought (CoT) methods by introducing dual-modal auxiliary decoders that supervise compact latent tokens to encode both linguistic reasoning and future scene dynamics.\n\n### Three CoT Paradigms\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Fcomparison.png\" alt=\"Comparison of three CoT paradigms\" width=\"90%\"\u002F>\n\u003C\u002Fdiv>\n\n> **(a) Explicit CoT** generates a full reasoning chain before the answer — interpretable but slow. **(b) Implicit CoT** compresses reasoning into opaque latent vectors — fast but not interpretable. **(c) OneVL (ours)** uses visual latent tokens `v` and language latent tokens `l`; during training, dual auxiliary decoders decode these into future frames and CoT text respectively. At inference, decoders are discarded and latents are **prefilled** into the prompt — matching the speed of (b) while recovering the interpretability of (a) in both vision and language.\n\n### Architecture\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Fframework.png\" alt=\"OneVL architecture\" width=\"90%\"\u002F>\n\u003C\u002Fdiv>\n\n> During training, hidden states at visual latent positions are routed to the **Visual Aux. Decoder** (predicts future-frame visual tokens at t+0.5s and t+1.0s) and at language latent positions to the **Language Aux. Decoder** (reconstructs CoT text). Both decoders are discarded at inference; all latent tokens are **prefilled** into the prompt, matching answer-only AR prediction latency.\n\nOneVL augments **Qwen3-VL-4B-Instruct** with:\n\n- **Latent Token Interface** — 4 visual latent tokens + 2 language latent tokens placed in the assistant response before the answer, using existing vocabulary tokens (no new special tokens).\n- **Visual Auxiliary Decoder** — Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (Emu3.5 IBQ, 131k codebook), acting as a **world model** supervision signal.\n- **Language Auxiliary Decoder** — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features.\n- **Prefill Inference** — Both decoders are discarded at inference; latent tokens are processed in one parallel pass with only the trajectory generated autoregressively.\n\n### Key Innovations\n\n- **Dual-Modal Auxiliary Decoders**: A *language auxiliary decoder* reconstructs human-readable CoT reasoning from language latent tokens; a *visual auxiliary decoder* predicts future scene frames from visual latent tokens, acting as a **world model** that grounds the latents in physical scene dynamics.\n- **Prefill Inference**: All latent tokens are prefilled into the prompt context in a single parallel pass — **1.5× faster than explicit CoT on NAVSIM, 2.3× faster on ROADWork** — with latency essentially identical to answer-only AR prediction.\n- **Compression Drives Generalization**: OneVL is the **only latent CoT method that outperforms explicit autoregressive CoT** across all four benchmarks.\n\n---\n\n## Open-Source Status\n\n| Component | Status |\n|-----------|--------|\n| 📄 Technical Report | ✅ [Tech report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.18486) |\n| ⚖️ Model Weights | ✅ [Weights](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fxiaomi-research\u002Fonevl-models) |\n| 🔍 Inference Code | ✅ [Code](https:\u002F\u002Fgithub.com\u002Fxiaomi-research\u002Fonevl)|\n| 🏋️ Training Code | ✅ [Code](https:\u002F\u002Fgithub.com\u002FGeorgeLuImmortal\u002FOneVL_training\u002Ftree\u002Fmain) |\n\n---\n\n## Results\n\n### Accuracy–Efficiency Pareto (NAVSIM & ROADWork)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Fteaser_bar.png\" alt=\"Teaser: Accuracy-Efficiency Pareto across benchmarks\" width=\"90%\"\u002F>\n\u003C\u002Fdiv>\n\n> OneVL lands in the **green-shaded optimal corner** (lowest latency, best metric) on both benchmarks. All prior latent CoT methods (COCONUT, CODI, SIM-CoT) underperform even the AR Answer baseline on driving tasks — a critical failure that OneVL overcomes.\n\n### NAVSIM — Full Comparison\n\n| Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |\n|--------|:----------:|:-----------:|:-------------:|:----------------:|\n| AdaThinkDrive | 8B | 86.20 | — | Language |\n| LaST-VLA | 8B | 87.30 | — | — |\n| AR Answer | 4B | 87.47 | \u003Cu>4.49\u003C\u002Fu> | — |\n| AR CoT+Answer | 4B | \u003Cu>88.29\u003C\u002Fu> | 6.58 | Language |\n| COCONUT | 4B | 84.84 | 5.93 | — |\n| CODI | 4B | 83.92 | 8.62 | — |\n| SIM-CoT | 4B | 84.21 | 10.86 | Language |\n| **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |\n\n### ROADWork — Full Comparison\n\n| Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ | Interpretability |\n|--------|:----------:|:----------:|:-------------:|:----------------:|\n| YNet | 22.68 | 80.78 | — | — |\n| AR Answer | 15.98 | 40.29 | \u003Cu>4.74\u003C\u002Fu> | — |\n| AR CoT+Answer | \u003Cu>13.18\u003C\u002Fu> | \u003Cu>29.98\u003C\u002Fu> | 10.74 | Language |\n| COCONUT | 15.44 | 38.60 | 6.06 | — |\n| CODI | 16.45 | 44.28 | 6.73 | — |\n| SIM-CoT | 16.49 | 44.32 | 6.19 | Language |\n| **OneVL** | **12.49** | **28.80** | **4.71** | **Vision + Language** |\n\n### Impromptu — Full Comparison\n\n| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ | Interpretability |\n|--------|:---------:|:---------:|:-------------:|:----------------:|\n| Impromptu VLA | 1.60 | 4.28 | 6.10 | — |\n| AR Answer | 1.46 | 4.03 | \u003Cu>4.24\u003C\u002Fu> | — |\n| AR CoT+Answer | \u003Cu>1.42\u003C\u002Fu> | \u003Cu>3.96\u003C\u002Fu> | 6.84 | Language |\n| COCONUT | 1.49 | 4.07 | 5.27 | — |\n| CODI | 1.86 | 5.18 | 5.24 | — |\n| SIM-CoT | 2.43 | 6.10 | 5.09 | Language |\n| **OneVL** | **1.34** | **3.70** | **4.02** | **Vision + Language** |\n\n### APR1 — Full Comparison\n\n| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ | Interpretability |\n|--------|:---------:|:---------:|:-------------:|:----------------:|\n| Cosmos-Reason | \u003Cu>2.86\u003C\u002Fu> | **7.42** | — | Language |\n| AR Answer | 3.27 | 9.59 | 3.06 | — |\n| AR CoT+Answer | 2.99 | 8.54 | 3.51 | Language |\n| COCONUT | 3.29 | 9.48 | 3.76 | — |\n| CODI | 3.22 | 9.25 | 3.85 | — |\n| SIM-CoT | 3.40 | 9.85 | 3.78 | Language |\n| **OneVL** | **2.62** | \u003Cu>7.53\u003C\u002Fu> | **3.26** | **Vision + Language** |\n\n### Text CoT Quality (NAVSIM)\n\n| Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Avg. ↑ | Latency (s) ↓ |\n|--------|:-----------------:|:-----------:|:-----------:|:------:|:------:|\n| AR CoT+Answer | 73.20 | 79.75 | 81.86 | **78.27** | \u003Cu>6.58\u003C\u002Fu> |\n| SIM-CoT | 67.20 | 76.25 | 78.73 | 74.06 | 10.86 |\n| **OneVL** (lang. aux.) | 71.00 | 78.26 | 79.13 | \u003Cu>76.13\u003C\u002Fu> | **4.46** |\n\nOneVL's language auxiliary decoder recovers 97% of explicit CoT quality while running at answer-only speed.\n\n### Ablation Study (NAVSIM PDM-score)\n\n| Model Variant | Lang. Aux. Dec. | Vis. Aux. Dec. | Staged Train | PDM-score ↑ |\n|---------------|:---------------:|:--------------:|:------------:|:-----------:|\n| OneVL w\u002Fo vis. dec. | ✓ | — | ✓ | 87.97 |\n| OneVL w\u002Fo lang. dec. | — | ✓ | ✓ | 88.53 |\n| OneVL w\u002Fo staged train | ✓ | ✓ | — | 67.13 |\n| **OneVL (full)** | **✓** | **✓** | **✓** | **88.84** |\n\nBoth auxiliary decoders contribute measurably; staged training is essential (without it, performance collapses to 67.13).\n\n---\n\n## Qualitative Examples\n\n### NAVSIM\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Fnavsim_example1.png\" alt=\"NAVSIM qualitative example\" width=\"95%\"\u002F>\n\u003C\u002Fdiv>\n\n> Each plot overlays ground-truth (green) and predicted (red) trajectories on the front camera view, along with predicted future frames at t+0.5s and t+1.0s decoded from the visual auxiliary decoder, and the language CoT from the language auxiliary decoder.\n\n### ROADWork (Construction Zone Navigation)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Froadwork_example1.png\" alt=\"ROADWork qualitative example\" width=\"95%\"\u002F>\n\u003C\u002Fdiv>\n\n---\n\n## Environment Setup\n\n**Requirements:** Python 3.10+, CUDA GPU (≥16 GB VRAM recommended for inference with aux decoders).\n\n```bash\n# 1. Create and activate virtual environment\nuv venv venv\u002Fonevl --python 3.12\nsource venv\u002Fonevl\u002Fbin\u002Factivate\n\n# 2. Install dependencies\npip install -r requirements.txt\n```\n\nCore packages (`requirements.txt`):\n\n```\ntorch==2.10.0\ntorchvision==0.25.0\ntransformers==4.57.0\nsafetensors==0.7.0\nPillow>=10.0.0\nomegaconf>=2.3.0\neinops>=0.7.0\nnumpy>=1.24.0\n```\n\n> **Note:** `transformers ≥ 4.57.0` is required for `Qwen3VLForConditionalGeneration` support.\n\n---\n\n## Inference\n\n### Quick Start (Single GPU)\n\n```bash\nsource venv\u002Fonevl\u002Fbin\u002Factivate\n\n# Trajectory prediction only (fastest, prefill inference)\npython infer_onevl.py \\\n    --model_path \u002Fpath\u002Fto\u002FOneVL-checkpoint \\\n    --test_set_path test_data\u002Fnavsim_test.json \\\n    --image_base_path \"\"\n    --output_path output\u002Fnavsim\u002Fresults.json \\\n    --device cuda:0 \\\n    --num_latent 2 --num_latent_vis 4 \\\n    --max_new_tokens 1024 --answer_prefix \"[\" --prefix_k 0\n\n# With language explanation (text CoT from language aux decoder)\npython infer_onevl.py \\\n    --model_path \u002Fpath\u002Fto\u002FOneVL-checkpoint \\\n    --test_set_path test_data\u002Fnavsim_test.json \\\n    --image_base_path \"\"\n    --output_path output\u002Fnavsim\u002Fresults_explain.json \\\n    --device cuda:0 \\\n    --num_latent 2 --num_latent_vis 4 \\\n    --max_new_tokens 1024 --answer_prefix \"[\" --prefix_k 0 \\\n    --decoder_explain --aux_visual_condition \\\n    --c_thought 2 --max_explain_tokens 1024\n\n# With both language + visual explanation (text CoT + future frame tokens)\npython infer_onevl.py \\\n    --model_path \u002Fpath\u002Fto\u002FOneVL-checkpoint \\\n    --test_set_path test_data\u002Fnavsim_test.json \\\n    --image_base_path \"\" \\\n    --output_path output\u002Fnavsim\u002Fresults_explain.json \\\n    --device cuda:0 \\\n    --num_latent 2 --num_latent_vis 4 \\\n    --max_new_tokens 1024 --answer_prefix \"[\" --prefix_k 0 \\\n    --decoder_explain --aux_visual_condition \\\n    --c_thought 2 --max_explain_tokens 1024 \\\n    --visual_decoder_explain --visual_aux_visual_condition \\\n    --c_thought_visual 4 --max_visual_tokens 2560\n```\n\n### Multi-GPU Inference (recommended for full test sets)\n\n```bash\nexport MODEL_PATH=\u002Fpath\u002Fto\u002FOneVL-checkpoint\nexport TEST_SET_PATH=test_data\u002Fnavsim_test.json\nexport OUTPUT_PATH=output\u002Fnavsim\u002Fnavsim_results.json\n\nbash run_infer.sh\n```\n\nThe launcher auto-detects available GPUs, shards the test set, runs inference in parallel across all GPUs, and merges results.\n\n### Per-Benchmark Scripts\n\n```bash\nbash scripts\u002Finfer_navsim.sh       # NAVSIM\nbash scripts\u002Finfer_ar1.sh          # APR1 (trajectory only)\nbash scripts\u002Finfer_roadwork.sh     # ROADWork\nbash scripts\u002Finfer_impromptu.sh    # Impromptu\n```\n\n### For visual cot\u002Ftext cot explain\n```bash\nbash scripts\u002Finfer_ar1_explain.sh  # APR1 (language + visual explanations, use APR1 as example)\n```\n\n### Evaluation\n\nAR1, Impromptu, and ROADWork can be evaluated directly with the bundled evaluation script:\n\n```bash\n# AR1\npython eval_results.py ar1 \\\n    --results_json output\u002Far1\u002Far1_results.json \\\n    --test_jsonl test_data\u002Far1_test.jsonl\n\n# Impromptu\npython eval_results.py impromptu \\\n    --results_json output\u002Fimpromptu\u002Fimpromptu_results.json \\\n    --test_jsonl test_data\u002Fimpromptu_test.jsonl\n\n# ROADWork\npython eval_results.py roadwork \\\n    --json_path output\u002Froadwork\u002Froadwork_results.json\n```\n\nNAVSIM uses the official NAVSIM evaluation pipeline. First convert OneVL inference results to the NAVSIM test format, then evaluate the converted file with the [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim) codebase:\n\n```bash\npython output\u002Fnavsim\u002Fconvert_to_eval.py \\\n    --input_path output\u002Fnavsim\u002Fnavsim_results.json \\\n    --ref_path output\u002Fnavsim\u002Fnavsim_results_eval.json \\\n    --output_path output\u002Fnavsim\u002Fnavsim_results_for_eval.json\n```\n\n\n---\n\n## Visualizing Future-Frame Predictions\n\nAfter running inference with `--visual_decoder_explain`, the output JSON contains `visual_decoder_explain` fields encoding predicted future-frame visual tokens. Use the visualization script to decode them back to images:\n\n```bash\nsource venv\u002Fonevl\u002Fbin\u002Factivate\n\npython scripts\u002Fvisualize_predict_image_tokens.py \\\n    --predict_json output\u002Far1_explain\u002Far1_results_explain.json \\\n    --out_dir output\u002Far1_explain_visualize \\\n    --model_root \u002Fpath\u002Fto\u002Femu35_model_root \\\n    -n 20 \\\n    --device cuda:0\n```\n\n**Output layout per sample:**\n\n```\noutput\u002Far1_explain_visualize\u002F\n└── sample_0000\u002F\n    ├── input_00.jpg                  # original camera frame(s)\n    ├── input_01.jpg\n    ├── ...\n    ├── decoded_from_tokens_00.png    # predicted future frame at t+0.5s\n    ├── decoded_from_tokens_01.png    # predicted future frame at t+1.0s\n    └── meta.json                     # CoT text + metadata\n```\n\nThe script uses the self-contained `vq_decoder\u002F` module (bundled Emu3.5 IBQ VQ-VAE) — no external Emu3.5 repo dependency required.\n\n`--model_root` must contain `Emu3.5-VisionTokenizer\u002Fconfig.yaml` and `Emu3.5-VisionTokenizer\u002Fmodel.ckpt`. Download from [BAAI\u002FEmu3.5-VisionTokenizer](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3.5-VisionTokenizer).\n\n---\n\n## Test Data Format\n\n### JSON array (NAVSIM, ROADWork)\n\n```json\n[\n  {\n    \"messages\": [{\"role\": \"user\", \"content\": \"\u003Cimage>Based on the current image, predict ...\"}],\n    \"images\": [\"path\u002Fto\u002Fframe.jpg\"],\n    \"GT\": \"[[1.0, 0.0], [2.5, 0.1], ...]\"\n  }\n]\n```\n\n### JSONL (APR1, Impromptu)\n\nOne JSON object per line, same schema as above.\n\n---\n\n**Environment variables** accepted by all scripts:\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `MODEL_PATH` | *(required)* | Path to the OneVL checkpoint |\n| `TEST_SET_PATH` | *(required)* | Test JSON \u002F JSONL file |\n| `OUTPUT_PATH` | `\u003CMODEL_PATH>\u002Finfer_results\u002Fonevl_merged.json` | Where to write merged results |\n| `IMAGE_BASE_PATH` | `\"\"` | Prepended to relative image paths |\n| `NUM_LATENT` | `2` | Number of language latent tokens |\n| `NUM_LATENT_VIS` | `4` | Number of visual latent tokens |\n| `MAX_NEW_TOKENS` | `1024` | Max answer tokens to generate |\n| `ANSWER_PREFIX` | `\"\"` | Prefix after `\u003Canswer>` (e.g. `[` for NAVSIM, `[[` for APR1) |\n| `PREFIX_K` | `0` |  Prefill first K GT waypoints after `\u003Canswer>` (default: 0), only used on ROADWork |\n| `DECODER_EXPLAIN` | `false` | Enable language auxiliary decoder |\n| `AUX_VISUAL_CONDITION` | `true` | *(if DECODER_EXPLAIN=true)* Condition language aux decoder on ViT features (`--aux_visual_condition`) |\n| `C_THOUGHT` | `2` | *(if DECODER_EXPLAIN=true)* Number of latent tokens read by language aux decoder |\n| `MAX_EXPLAIN_TOKENS` | `1024` | *(if DECODER_EXPLAIN=true)* Max tokens generated by language aux decoder |\n| `VISUAL_DECODER_EXPLAIN` | `false` | Enable visual auxiliary decoder |\n| `VISUAL_AUX_VISUAL_CONDITION` | `true` | *(if VISUAL_DECODER_EXPLAIN=true)* Condition visual aux decoder on ViT features (`--visual_aux_visual_condition`) |\n| `C_THOUGHT_VISUAL` | `4` | *(if VISUAL_DECODER_EXPLAIN=true)* Number of latent tokens read by visual aux decoder |\n| `MAX_VISUAL_TOKENS` | `2560` | *(if VISUAL_DECODER_EXPLAIN=true)* Max visual tokens generated by visual aux decoder |\n\n--- \n\n## Citation\n\nIf you find this work useful, please cite:\n\n```bibtex\n@article{lu2026onevl,\n  title={OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},\n  author={Lu, Jinghui and Guan, Jiayi and Huang, Zhijian and Li, Jinlong and Li, Guang and Kong, Lingdong and Li, Yingyan and Wang, Han and Xu, Shaoqing and Luo, Yuechen and others},\n  journal={arXiv preprint arXiv:2604.18486},\n  year={2026},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.18486}\n}\n```\n\n---\n\n## License\n\nThis project is released under the [Apache 2.0 License](LICENSE).\n\nModel weights are built on [Qwen3-VL-4B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3.5-VisionTokenizer); please refer to their respective licenses as well.\n\n---\n\n## Acknowledgements\n\n- [Qwen3-VL](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-VL) — backbone VLM\n- [Emu3.5](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu3) — IBQ visual tokenizer\n- [AdaThinkDrive](https:\u002F\u002Fgithub.com\u002Fluo-yc17\u002FAdaThinkDrive\u002Ftree\u002Fmain) — NAVSIM CoT annotations\n- [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim), [ROADWork](https:\u002F\u002Fgithub.com\u002Fanuragxel\u002Froadwork-dataset), [Impromptu](https:\u002F\u002Fgithub.com\u002Fahydchh\u002FImpromptu-VLA) — evaluation benchmarks\n","OneVL 是一个面向自动驾驶的视觉-语言-动作（VLA）框架，通过引入双模态辅助解码器来监督紧凑的潜在标记，从而编码语言推理和未来场景动态，实现了最先进的轨迹预测精度，并且推断延迟与仅回答模型相当。其核心技术特点在于结合了显式链式思维（CoT）的可解释性和隐式CoT的速度优势，使用视觉潜在标记`v`和语言潜在标记`l`，在训练时分别由视觉辅助解码器预测未来的帧图像和语言辅助解码器重建CoT文本，在推理阶段则移除解码器，将潜在标记预填充到提示中以加速处理。适用于需要高效准确地理解环境并作出决策的自动驾驶场景。","2026-06-11 03:54:28","CREATED_QUERY"]