[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1338":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":13,"stars7d":13,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":16,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":25,"discoverSource":26},1338,"MOSS-VL","OpenMOSS\u002FMOSS-VL","OpenMOSS","MOSS-VL is the core multimodal model series within the OpenMOSS ecosystem, dedicated to visual understanding.",null,"Python",258,4,2,0,12,6,2.1,"Apache License 2.0",false,"main",[],"2026-06-12 02:00:26","\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002Flogo.png\" width=\"300\"\u002F>\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-VL\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGithub-Star-yellow?logo=Github&amp\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenMOSS-Team\u002Fmoss-vl\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Download-orange?logo=Huggingface&amp\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMOSS-Team\u002FMOSS-VL\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-space-orange?logo=Huggingface\"\" alt=\"MOSS-VL-space\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fcollections\u002Fopenmoss\u002FMOSS-VL\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Download-blue?logo=ModelScope\" alt=\"ModelScope\">\u003C\u002Fa>\n    \u003Cbr>\n    \u003Ca href=\"https:\u002F\u002FOpenMOSS.github.io\u002FMOSS-VL-Demo\u002F#\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-View-blue?logo=Website&amp\">\u003C\u002Fa>\n    \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-Coming%20Soon-red?logo=Arxiv\">\u003C\u002Fa>\n    \u003Ca href=\"assets\u002Fwechat.jpg\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWechat-Join-green?logo=wechat&amp\">\u003C\u002Fa>\n    \u003Ca href=\".\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\" alt=\"license\">\u003C\u002Fa>\n\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\".\u002FREADME.md\">\u003Cb>English\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\".\u002FREADME_zh.md\">\u003Cb>中文\u003C\u002Fb>\u003C\u002Fa>\n\u003C\u002Fp>\n\n## MOSS-VL\n\n**MOSS-VL** is the core multimodal model series within the OpenMOSS ecosystem, dedicated to advancing visual understanding. To tackle the inherent complexities of video comprehension, our roadmap pursues a systematic scaling strategy along three key dimensions:     \n\n- 📈 **Data Scaling**: Curating massive-scale, high-quality multimodal datasets to drive robust generalization.\n- 🧠 **Parameter Scaling**: Expanding model capacity to capture intricate vision-language correlations.\n- ⏳ **Context Scaling**: Extending temporal horizons to enable reasoning over long-form video content.\n\n---\n\n## 📌 Table of Contents\n- [🔥 News](#-news)\n- [🏗️ Model Architecture](#️-model-architecture)\n- [🧩 Absolute Timestamps](#-absolute-timestamps)\n- [🧬 Cross-attention RoPE (XRoPE)](#-cross-attention-rope-xrope)\n- [🎬 Demo](#-demo)\n- [📊 Training Strategy](#-training-strategy)\n- [📊 Evaluation Results](#-evaluation-results)\n- [🚀 Quick Start](#-quick-start)\n- [📥 Model Download](#-model-download)\n- [🖥️ SGLang](#️-sglang)\n- [📑 Roadmap & TODO List](#-roadmap--todo-list)\n- [📜 Citation](#-citation)\n\n---\n\n## 🔥 News\n- **2026\u002F04\u002F24**: 🚀 SGLang officially supports MOSS-VL — see [sgl-project\u002Fsglang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang).\n- **2026\u002F04\u002F22**: 🚀 Released SGLang-based inference support for MOSS-VL. See [`.\u002Fsglang\u002F`](.\u002Fsglang\u002F).\n- **2026\u002F04\u002F22**: 🤗 Updated HuggingFace inference code to the latest version. See [MOSS-VL-Base-0408](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-VL-Base-0408) and [MOSS-VL-Instruct-0408](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-VL-Instruct-0408).\n- **2026\u002F04\u002F08**: 🚀 Released MOSS-VL-Base-0408 and MOSS-VL-Instruct-0408.\n- **2026\u002F04\u002F03**: 🏆 Finished both pre-training and SFT for MOSS-VL.\n- **2025\u002F10\u002F18**: 🔍 Kicked off the MOSS-VL project. \n- **2025\u002F09\u002F30**: ✨ Finished training [MOSS-Video-Preview](https:\u002F\u002Fgithub.com\u002Ffnlp-vision\u002FMOSS-Video-Preview) .\n\n## 🏗️ Model Architecture\n**MOSS-VL** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design significantly reduces latency, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline — eliminating the need for heavy pre-processing.\n    \n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002Fstructure.png\" alt=\"MOSS-VL Architecture\" width=\"90%\"\u002F>\n    \u003Cbr>\n    \u003Cem>Figure 1: Overall architecture of MOSS-VL.\u003C\u002Fem>\n\u003C\u002Fp>\n\n---\n\n## 🧩 Absolute Timestamps\n\nTo ensure the model accurately perceives the pacing and duration of events, **MOSS-VL** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**.\n\n### 📥 Input Representation\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002Ftimestamp_input.svg\" alt=\"Timestamped Sequence Input Illustration\" width=\"90%\"\u002F>\n    \u003Cbr>\n    \u003Cem>Figure 2: Illustration of timestamped video sequence input.\u003C\u002Fem>\n\u003C\u002Fp>\n\nEach video is interleaved with **precise time markers**, where each timestamp is wrapped by **dedicated special tokens** (`\u003C|time_start|>` … `\u003C|time_end|>`) that explicitly anchor the **temporal location** of every visual frame:\n\n```text\n\u003C|im_start|>\u003C|vision_start|>\n\u003C|time_start|>0.0 seconds\u003C|time_end|>\u003C|image_pad|>\n\u003C|time_start|>1.2 seconds\u003C|time_end|>\u003C|image_pad|>\n\u003C|time_start|>2.3 seconds\u003C|time_end|>\u003C|image_pad|>\n...\n\u003C|vision_end|>The video shows a dynamic scene with continuous actions...\u003C|im_end|>\n```\n\n**🌟 Why this matters:**\n- **Adaptability to Variable FPS:** The use of explicit timestamps allows the model to handle non-uniform sampling rates without loss of temporal context.\n- **Precise Temporal Analysis:** Absolute time unlocks fine-grained action localization, grounding every response in exact temporal coordinates. \n- **Motion Dynamics:** By exposing time intervals ($dt$), the model can reason about movement physics, enabling accurate estimation of velocity, acceleration, and trajectory.\n\n---\n\n## 🧬 Cross-attention RoPE (XRoPE)\n\nMOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based vision–language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).\n\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002F3d-rope.png\" alt=\"MOSS-VL mRoPE Architecture Illustration\" width=\"80%\"\u002F>\n    \u003Cbr>\n    \u003Cem> Figure 3: MOSS-VL with Cross-attention RoPE (XRoPE).\u003C\u002Fem>\n\u003C\u002Fp>\n\nTo optimize cross-modal alignment, **XRoPE** is injected into the vision **Key (K)** for position-awareness while leaving the **Value (V)** untouched to preserve feature fidelity. In parallel, it is applied to the text **Query (Q)**, allowing the model to probe arbitrary spatio-temporal regions through direct coordinate alignment.\n\n**🌟 Why this matters**\n\n- **Unified Modality Modeling** — By expressing time as a shared dimension across both language and video, **XRoPE** enables seamless, cohesive video-text reasoning within a single coordinate system.\n- **Precise Grounding** — Aligned ($t, h, w$) coordinates empower the model to localize small objects and transient actions anywhere in the 3D video volume — down to the patch and the moment.\n- **Dynamic Input Support** — The 3D grid natively accommodates arbitrary aspect ratios and resolutions, eliminating the need for fixed-length padding or rigid input constraints.\n\n\n---\n\n## 🎬 Demo\n\n\u003Cdiv align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgist.github.com\u002Fuser-attachments\u002Fassets\u002F66406aaa-f09f-412c-87b1-97753895ef1f\n\" width=\"70%\" poster=\"\" controls>\u003C\u002Fvideo>\n\u003Cvideo src=\"https:\u002F\u002Fgist.github.com\u002Fuser-attachments\u002Fassets\u002Fd1ccae33-472f-4d92-96c4-fb6253b07189\n\" width=\"70%\" poster=\"\" controls>\u003C\u002Fvideo>\n  \u003Cp align=\"center\">\n    For more examples, please visit our \u003Ca href=\"https:\u002F\u002FOpenMOSS.github.io\u002FMOSS-VL-Demo\u002F#\u002F\">Interactive Demo Page\u003C\u002Fa> 🚀\u003Cbr\u002F>\n    HuggingFace Demo: \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMOSS-Team\u002FMOSS-VL\">HuggingFace Space\u003C\u002Fa> 🌐\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## 📊 Training Strategy\nMOSS-VL is trained using a multi-stage approach to progressively build multimodal capabilities.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002Ftotal_data_distribution.png\" alt=\"MOSS-VL Training Data Distribution\" width=\"80%\"\u002F>\n    \u003Cbr>\n    \u003Cem>Figure 4: Overall training data distribution of MOSS-VL.\u003C\u002Fem>\n\u003C\u002Fp>\n\n### Pre-training（PT）\nMOSS-VL is pre-trained via a systematic four-stage curriculum that progressively builds up multimodal capabilities from the ground up:\n\n* **Stage 1 — Vision-Language Alignment** — Establishes the initial bridge between visual features and the language space. Training on large-scale image-text pairs, the model learns to associate visual concepts with their textual counterparts while developing foundational OCR skills for text-in-image understanding.\n\n* **Stage 2 — Large-Scale Multimodal Pre-training** — Scales up exposure to massive, diverse multimodal corpora, broadening the model's grasp of world knowledge and complex scenes — laying a robust foundation for general-purpose intelligence and high-resolution perception. In addition, short video clips are introduced at this stage to seed preliminary video understanding.\n\n* **Stage 3 — High-Quality Multimodal Pre-training** — Elevates overall model quality by training on large volumes of high-quality perception, understanding, and reasoning data. This phase combines fine-grained image perception, complex multi-image comprehension, and high-fidelity video reasoning to sharpen the model's ability to capture intricate visual details and master temporal relationships across rich multimodal inputs.\n\n* **Stage 4 — Annealing & Long-Context Extrapolation** — Stretches the model's horizon toward long-form video understanding, while a carefully designed annealing schedule trains on curated, top-tier multimodal data to push final performance to its peak.\n\n| Stage | Strategy | Data Composition |\n| :--- | :--- | :--- |\n| **1** | **Vision-Language Alignment** | \u003Cimg src=\"assets\u002Fpt-stage1.png\" width=\"400\"\u002F> |\n| **2** | **Large-Scale Multimodal Pre-training** | \u003Cimg src=\"assets\u002Fpt-stage2.png\" width=\"400\"\u002F> |\n| **3** | **High-Quality Multimodal Pre-training** | \u003Cimg src=\"assets\u002Fpt-stage3.png\" width=\"400\"\u002F> |\n| **4** | **Annealing & Long-Context Extrapolation** | \u003Cimg src=\"assets\u002Fpt-stage4.png\" width=\"400\"\u002F> |\n\n\n### Supervised Fine-Tuning (SFT)\nBuilding on the pre-trained foundation, **MOSS-VL** is further refined through **Supervised Fine-Tuning (SFT)** to align with human intent and unlock its full interactive and instruction-following capabilities.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002FSFT.png\" alt=\"MOSS-VL SFT Data Composition\" width=\"50%\"\u002F>\n    \u003Cbr>\n    \u003Cem>Figure 5: Data composition of MOSS-VL SFT.\u003C\u002Fem>\n\u003C\u002Fp>\n\n\n### Reinforcement Learning from Human Feedback (RLHF)\n\n> [!NOTE]\n> MOSS-VL is currently undergoing RLHF training. Stay tuned for updates.\n\n\n---\n\n\n## 📊 Evaluation Results\nWe conducted a comprehensive evaluation of MOSS-VL across four key dimensions: Multimodal Perception, Multimodal Reasoning,Document\u002FOCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.\n\n### Overall Performance\n\nThe table below reports benchmark scores on a 0–100 scale. Across the board, MOSS-VL consistently ranks first or second when compared against industry-leading baselines such as Qwen2.5-VL and Qwen3-VL. \n\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002FMOSS-VL-Benchmark.png\" alt=\"MOSS-VL Benchmark Comparison\" width=\"90%\"\u002F>\n    \u003Cbr>\n    \u003Cem>Figure 6: Detailed benchmark comparison between MOSS-VL and Qwen series.\u003C\u002Fem>\n\u003C\u002Fp>\n\n### Key Highlights\n\n*   **🚀 Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).\n*   **👁️ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.\n*   **🧠 Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as `CVBench` and `VisuLogic`.\n*   **📄 Reliable Document Understanding**: While the model is primarily optimized for general perception and video, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.\n\n### Benchmark Analysis\n\nThe chart below visualizes MOSS-VL's balanced and well-rounded capability profile across 30+ specialized benchmarks. Represented by the solid blue region, MOSS-VL achieves the broadest overall coverage, with particularly strong showings in the Video Understanding and Multimodal Perception quadrants.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fradar.png\" width=\"600px\" alt=\"MOSS-VL Evaluation Radar\">\n  \u003Cbr>\n  \u003Cem>Figure 7: Benchmark analysis of MOSS-VL.\u003C\u002Fem>\n\u003C\u002Fp>\n\n---\n\n\n## 🚀 Quick Start\n\n### Environment Setup\n```bash\nconda create -n moss_vl python=3.12 pip -y\nconda activate moss_vl\npip install -i https:\u002F\u002Fpypi.org\u002Fsimple --no-build-isolation -r requirements.txt\n```\n\n### Run Inference\n\nFor complete runnable examples and demo assets, see [`inference\u002FREADME.md`](inference\u002FREADME.md).\nInference supports full-modality offline queries, including pure text, single\u002Fmulti-image, single\u002Fmulti-video, and interleaved image-video inputs in the `messages` format.\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Single-query inference with \u003Ccode>offline_generate\u003C\u002Fcode>\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n```python\nimport queue\nimport threading\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoProcessor\n\ncheckpoint = \"\u002Fpath\u002Fto\u002Fdummy-checkpoint\"\n\nprocessor = AutoProcessor.from_pretrained(\n    checkpoint,\n    trust_remote_code=True,\n    frame_extract_num_threads=1,\n)\nmodel = AutoModelForCausalLM.from_pretrained(\n    checkpoint,\n    trust_remote_code=True,\n    device_map=\"auto\",\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n)\n\nquery = {\n    \"messages\": [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"image\", \"image\": \"path\u002Fto\u002Fexample.jpg\"},\n                {\"type\": \"text\", \"text\": \"Describe this image.\"},\n            ],\n        }\n    ],\n    \"media_kwargs\": {},\n    \"generate_kwargs\": {\n        \"max_new_tokens\": 256,\n        \"do_sample\": False,\n        \"vision_chunked_length\": 64,\n    },\n}\n\ninput_queue = queue.Queue()\noutput_queue = queue.Queue()\nworker = threading.Thread(\n    target=model.offline_generate,\n    args=(processor, input_queue, output_queue),\n    kwargs={\"vision_chunked_length\": 64},\n    daemon=True,\n)\nworker.start()\n\ninput_queue.put(query)\ntext_chunks = []\nwhile True:\n    item = output_queue.get()\n    if item in {\"\u003C|round_start|>\"}:\n        continue\n    if item == \"\u003C|round_end|>\":\n        break\n    text_chunks.append(item)\n\nprint(\"\".join(text_chunks))\n\ninput_queue.put({\"stop_offline_generate\": True})\nworker.join()\n```\n\n\u003C\u002Fdetails>\n\nFor simple batched offline inference, you can also use `offline_batch_generate`:\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Batched inference with \u003Ccode>offline_batch_generate\u003C\u002Fcode>\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoProcessor\n\ncheckpoint = \"\u002Fpath\u002Fto\u002Fdummy-checkpoint\"\n\nprocessor = AutoProcessor.from_pretrained(\n    checkpoint,\n    trust_remote_code=True,\n    frame_extract_num_threads=1,\n)\nmodel = AutoModelForCausalLM.from_pretrained(\n    checkpoint,\n    trust_remote_code=True,\n    device_map=\"auto\",\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n)\n\nqueries = [\n    {\n        \"messages\": [\n            {\n                \"role\": \"user\",\n                \"content\": [{\"type\": \"text\", \"text\": \"Describe sample A.\"}],\n            }\n        ],\n        \"media_kwargs\": {},\n        \"generate_kwargs\": {\"max_new_tokens\": 256, \"do_sample\": False},\n    },\n    {\n        \"messages\": [\n            {\n                \"role\": \"user\",\n                \"content\": [{\"type\": \"text\", \"text\": \"Describe sample B.\"}],\n            }\n        ],\n        \"media_kwargs\": {},\n        \"generate_kwargs\": {\"max_new_tokens\": 256, \"do_sample\": False},\n    },\n]\n\nwith torch.no_grad():\n    result = model.offline_batch_generate(\n        processor,\n        queries,\n        vision_chunked_length=64,\n    )\n\ntexts = [item[\"text\"] for item in result[\"results\"]]\nprint(texts)\n```\n\n\u003C\u002Fdetails>\n\n### Run Fine-Tuning\n\nWe provide a lightweight SFT framework built on HuggingFace `transformers.Trainer`. It supports full-parameter training, LoRA, with the vision encoder, language model, and LM head independently controllable.\n\n```bash\n# Full-parameter SFT (vision encoder frozen by default)\nbash mossvl_finetune\u002Fscripts\u002Frun_sft.sh\n\n# LoRA SFT\npip install -i https:\u002F\u002Fpypi.org\u002Fsimple peft\nbash mossvl_finetune\u002Fscripts\u002Frun_sft_lora.sh\n```\n\nTraining data uses a simple JSON format compatible with the inference query structure — just add a `response` field:\n\n```json\n[\n  {\n    \"prompt\": \"Describe this image.\",\n    \"response\": \"A beautiful landscape with mountains.\",\n    \"images\": [\"path\u002Fto\u002Fimage.jpg\"],\n    \"videos\": []\n  }\n]\n```\n\n**Multi-turn conversations are also supported.** See [`mossvl_finetune\u002FREADME.md`](mossvl_finetune\u002FREADME.md) for full documentation.\n\n---\n\n## 📥 Model Download \n\n| Model | 🤗Download Link | 🤖ModelScope Link |\n| :--- | :--- | :--- |\n| **MOSS-VL-Base-0408** | [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-VL-Base-0408) | [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-VL-Base-0408) |\n| **MOSS-VL-Instruct-0408** | [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-VL-Instruct-0408) | [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-VL-Instruct-0408) |\n\n## 🖥️ SGLang\n\nFor SGLang-based deployment and serving instructions, please refer to [`sglang\u002FREADME.md`](.\u002Fsglang\u002FREADME.md).\n\n\n---\n## 📑 Roadmap & TODO List\n\n### ✅ Milestones\n- [x] **Core Architecture:** Implementation of Cross-attention RoPE (XRoPE).\n- [x] **High-performance Infra:** Integrated Megatron-LM + CUDA Flash Attention 3.\n- [x] **Model Release:** Open-sourced `MOSS-VL-Base` and `MOSS-VL-Instruct`.\n- [x] **Inference:** Inference code for both image and video understanding.\n\n### 🚀 Upcoming\n- [ ] **Training Engine:** Full training code for MOSS-VL.\n- [ ] **Real-time Capabilities:** Specialized Real-time Video Understanding Model.\n- [ ] **RL Post-training:** Reinforcement Learning for MOSS-VL series.\n- [ ] **Documentation:** Comprehensive Technical Report.\n\n---\n\n## 🤝 Acknowledgement\nWe would like to express our gratitude to **NVIDIA** for the [Megatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) framework and the **Qwen Team** for their powerful [Qwen](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen) series language models, which serve as the foundation of our training infrastructure and core LLM. We also thank the **SGLang Team** for their high-performance [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) serving framework, which powers efficient deployment of MOSS-VL.\n\n## 📜 Citation\n```bibtex\n@misc{moss_vl_2026,\n  title         = {{MOSS-VL Technical Report}},\n  author        = {OpenMOSS Team},\n  year          = {2026},\n  howpublished  = {\\url{https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-VL}},\n  note          = {GitHub repository}\n}\n```\n\n## 🌟 Star History\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F?repos=OpenMOSS%2FMOSS-VL&type=date&legend=top-left\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=OpenMOSS\u002FMOSS-VL&type=date&theme=dark&legend=top-left\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=OpenMOSS\u002FMOSS-VL&type=date&legend=top-left\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=OpenMOSS\u002FMOSS-VL&type=date&legend=top-left\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n\u003Cp align=\"center\">\nBuilt with ❤️ by the \u003Cb>OpenMOSS Team\u003C\u002Fb>\n\u003C\u002Fp>\n","MOSS-VL是OpenMOSS生态系统中的核心多模态模型系列，专注于视觉理解。该项目通过大规模高质量多模态数据集的构建、模型参数扩展以捕捉复杂的视觉-语言关联以及时间维度上的扩展来支持长视频内容的理解，从而系统性地提升视频理解能力。它采用Python语言开发，并遵循Apache License 2.0开源协议。适用于需要深度视觉分析的应用场景，如视频内容解析、图像识别等领域。","2026-06-11 02:43:08","CREATED_QUERY"]