[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72510":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},72510,"GLM-V","zai-org\u002FGLM-V","zai-org","GLM-4.6V\u002F4.5V\u002F4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning","",null,"Python",2325,171,15,11,0,4,6,19,12,28.71,"Apache License 2.0",false,"main",[26,27,28,29],"image2text","reasoning","video-understanding","vlm","2026-06-12 02:03:04","# GLM-V\n\n[中文阅读.](.\u002FREADME_zh.md)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=resources\u002Flogo.svg width=\"40%\"\u002F>\n\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n    👋 Join our \u003Ca href=\"resources\u002FWECHAT.md\" target=\"_blank\">WeChat\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FeQbGCYS9ym\" target=\"_blank\">Discord\u003C\u002Fa> communities.\n    \u003Cbr>\n    📖 Check out the GLM-4.6V \u003Ca href=\"https:\u002F\u002Fz.ai\u002Fblog\u002Fglm-4.6v\" target=\"_blank\">blog\u003C\u002Fa> and GLM-4.5V & GLM-4.1V \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006\" target=\"_blank\">paper\u003C\u002Fa>.\n    \u003Cbr>\n    📍 Try \u003Ca href=\"https:\u002F\u002Fchat.z.ai\u002F\" target=\"_blank\">online\u003C\u002Fa> or use the \u003Ca href=\"https:\u002F\u002Fdocs.z.ai\u002Fguides\u002Fvlm\u002Fglm-4.6v\" target=\"_blank\">API\u003C\u002Fa>.\n\u003C\u002Fp>\n\n## Introduction\n\nVision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow\nincreasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception —\nimproving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding,\nand multimodal agents.\n\nThrough our open-source work, we aim to explore the technological frontier together with the community while empowering\nmore developers to create exciting and innovative applications.\n\n**This open-source repository contains our `GLM-4.6V`, `GLM-4.5V` and `GLM-4.1V` series models.** For performance and\ndetails, see [Model Overview](#model-overview). For known issues,\nsee [Fixed and Remaining Issues](#fixed-and-remaining-issues).\n\n## Project Updates\n\n- **News**: `2026\u002F04\u002F02`: We released [GLM-5V-Turbo](https:\u002F\u002Fdocs.z.ai\u002Fguides\u002Fvlm\u002Fglm-5v-turbo) \n  and [GLM-skills](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-skills).\n- **News**: `2026\u002F03\u002F28`: We have released multiple GLM-V related Skills, covering several specialized areas\n  such as GLM-V-Grounding and GLM-V-Prompt-Gen. You are welcome to try them [here](skills).\n- **News**: `2025\u002F11\u002F10`: We released **UI2Code^N**, a RL-enhanced UI coding model with UI-to-code, UI-polish, and\n  UI-edit capabilities. The model is trained based on `GLM-4.1V-Base`. Check it\n  out [here](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FUI2Code_N).\n- **News**: `2025\u002F10\u002F27`: We’ve released **Glyph**, a framework for scaling the context length through visual-text\n  compression, the glyph model trained based on `GLM-4.1V-Base`. Check it\n  out [here](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGlyph).\n- **News**: `2025\u002F08\u002F11`: We released **GLM-4.5V** with significant improvements across multiple benchmarks. We also\n  open-sourced our handcrafted **desktop assistant app** for debugging. Once connected to GLM-4.5V, it can capture\n  visual information from your PC screen via screenshots or screen recordings. Feel free to try it out or customize it\n  into your own multimodal assistant. Click [here](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fzai-org\u002FGLM-4.5V-Demo-App) to download\n  the installer or [build from source](examples\u002Fvllm-chat-helper\u002FREADME.md)!\n- **News**: `2025\u002F07\u002F16`: We have open-sourced the **VLM Reward System** used to train GLM-4.1V-Thinking.View\n  the [code repository](glmv_reward) and run locally: `python examples\u002Freward_system_demo.py`.\n- **News**: `2025\u002F07\u002F01`: We released **GLM-4.1V-9B-Thinking** and\n  its [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006).\n\n## Model Implementation Code\n\n- GLM-4.5V and GLM-4.6V model algorithm: see the full implementation\n  in [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Ftree\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Fglm4v_moe).\n- GLM-4.1V-9B-Thinking model algorithm: see the full implementation\n  in [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Ftree\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Fglm4v).\n- Both models share identical multimodal preprocessing, but use different conversation templates — please distinguish\n  carefully.\n\n## Model Downloads\n\n| Model                | Download Links                                                                                                                                       | Type             |\n|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|\n| GLM-4.6V             | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V)                         | Hybrid Reasoning |\n| GLM-4.6V-FP8         | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V-FP8)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V-FP8)                 | Hybrid Reasoning |\n| GLM-4.6V-Flash       | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V-Flash)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V-Flash)             | Hybrid Reasoning |\n| GLM-4.5V             | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.5V)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.5V)                         | Hybrid Reasoning |\n| GLM-4.5V-FP8         | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.5V-FP8)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.5V-FP8)                 | Hybrid Reasoning |\n| GLM-4.1V-9B-Thinking | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.1V-9B-Thinking)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.1V-9B-Thinking) | Reasoning        |\n| GLM-4.1V-9B-Base     | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.1V-9B-Base)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.1V-9B-Base)         | Base             |\n\n\n+ Hugging Face provides GGUF format model weights. You can download the GGUF format model of GLM-V from [here](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fggml-org\u002Fglm-v).\n\n## Using Case\n\n### Grounding\n\nGLM-4.5V \u002F GLM-4.6V \u002F GLM-4.1V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, the model\nis able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports\ncomplex descriptions of the target object as well as specified output formats, for example:\n>\n> - Help me to locate \u003Cexpr> in the image and give me its bounding boxes.\n> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. \u003Cexpr>\n\nHere, `\u003Cexpr>` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$\ncomposed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image\nwidth (for x) or height (for y) and scaled by 1000.\n\nIn the response, the special tokens `\u003C|begin_of_box|>` and `\u003C|end_of_box|>` are used to mark the image bounding box in\nthe answer. The bracket style may vary ([], [[]], (), \u003C>, etc.), but the meaning is the same: to enclose the coordinates\nof the box.\n\n### GUI Agent\n\n- `examples\u002Fgui-agent`: Demonstrates prompt construction and output handling for GUI Agents, including strategies for\n  mobile, PC, and web. Prompt templates differ between GLM-4.1V and GLM-4.5V.\n\n### Quick Demo\n\n- `examples\u002Fvlm-helper`: A desktop assistant for GLM multimodal models (mainly GLM-4.5V, compatible with GLM-4.1V),\n  supporting text, images, videos, PDFs, PPTs, and more. Connects to the GLM multimodal API for intelligent services\n  across scenarios. Download the [installer](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fzai-org\u002FGLM-4.5V-Demo-App)\n  or [build from source](examples\u002Fvlm-helper\u002FREADME.md).\n\n## Quick Start\n\n### Environment Installation\n\n```bash\npip install -r requirements.txt\n```\n\n- vLLM and SGLang dependencies may conflict, so it is recommended to install only one of them in each environment.\n- Please note that after installation, you should verify the version of `transformers` and ensure it is upgraded to `5.2.0` or above.\n\n### transformers\n\n- `trans_infer_cli.py`: CLI for continuous conversations using `transformers` backend.\n- `trans_infer_gradio.py`: Gradio web interface with multimodal input (images, videos, PDFs, PPTs) using `transformers`\n  backend.\n- `trans_infer_bench`: Academic reproduction script for `GLM-4.1V-9B-Thinking`. It forces reasoning truncation at length\n  `8192` and requests direct answers afterward. Includes a video input example; modify for other cases.\n\n### vLLM\n\n```bash\nvllm serve zai-org\u002FGLM-4.6V \\\n     --tensor-parallel-size 4 \\\n     --tool-call-parser glm45 \\\n     --reasoning-parser glm45 \\\n     --enable-auto-tool-choice \\\n     --served-model-name glm-4.6v \\\n     --allowed-local-media-path \u002F \\\n     --mm-encoder-tp-mode data \\\n     --mm_processor_cache_type shm\n```\n\nFor more detail, check [vLLM Recipes](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Frecipes\u002Fblob\u002Fmain\u002FGLM\u002FGLM-V.md).\n\n### SGlang\n\n```shell\nsglang serve --model-path zai-org\u002FGLM-4.6V \\\n     --tp-size 4 \\\n     --tool-call-parser glm45 \\\n     --reasoning-parser glm45 \\\n     --served-model-name glm-4.6v \\\n     --mm-enable-dp-encoder \\\n     --port 8000 \\\n     --host 0.0.0.0\n```\n\nNotes:\n\n- We recommend increasing `SGLANG_VLM_CACHE_SIZE_MB` (e.g., `1024`) to provide sufficient cache space for video\n  understanding.\n- When using `vLLM` and `SGLang`, thinking mode is enabled by default. To disable the thinking switch, Add:\n  `extra_body={\"chat_template_kwargs\": {\"enable_thinking\": False}}`\n- You can configure a thinking budget to limit the model’s maximum reasoning span. Add\n\n    ```python\n  from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor\n    ```\n\n  and\n\n    ```python\n  extra_body={\n            \"custom_logit_processor\": Glm4MoeThinkingBudgetLogitProcessor().to_str(),\n            \"custom_params\": {\n                \"thinking_budget\": 8192, # max reasoning length in tokens\n            },\n        },\n    ```\n\n### xLLM\n\ncheck [here](examples\u002FAscend_NPU\u002FREADME_zh.md) for detailed instructions.\n\n## Integration with Other Automation Tools\n\n### Midscene.js\n\n[Midscene.js](https:\u002F\u002Fmidscenejs.com\u002Fen\u002Findex.html) is an open-source UI automation SDK driven by vision models, supporting multi-platform automation through JavaScript or Yaml-format process syntax.\n\nMidscene.js has completed integration with GLM-V models. You can quickly experience GLM-V through the [Midscene.js Integration Guide](https:\u002F\u002Fmidscenejs.com\u002Fmodel-common-config.html#glm-v).\n\nHere are two examples to help you get started quickly:\n\n- [Call Midscene.js via TypeScript scripts](.\u002Fexamples\u002Fmidscene-ts-demo)\n- [Experience Midscene.js via Yaml scripts](.\u002Fexamples\u002Fmidscene-yaml-demo)\n\n## Model Fine-tuning\n\n[LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) already supports fine-tuning for GLM-4.5V &\nGLM-4.1V-9B-Thinking models. Below is an example of dataset construction using two images. You should organize your\ndataset into `finetune.json` in the following format, This is an example for fine-tuning GLM-4.1V-9B.\n\n```json\n[\n  {\n    \"messages\": [\n      {\n        \"content\": \"\u003Cimage>Who are they?\",\n        \"role\": \"user\"\n      },\n      {\n        \"content\": \"\u003Cthink>\\nUser asked me to observe the image and find the answer. I know they are Kane and Goretzka from Bayern Munich.\u003C\u002Fthink>\\n\u003Canswer>They're Kane and Goretzka from Bayern Munich.\u003C\u002Fanswer>\",\n        \"role\": \"assistant\"\n      },\n      {\n        \"content\": \"\u003Cimage>What are they doing?\",\n        \"role\": \"user\"\n      },\n      {\n        \"content\": \"\u003Cthink>\\nI need to observe what these people are doing. Oh, they are celebrating on the soccer field.\u003C\u002Fthink>\\n\u003Canswer>They are celebrating on the soccer field.\u003C\u002Fanswer>\",\n        \"role\": \"assistant\"\n      }\n    ],\n    \"images\": [\n      \"mllm_demo_data\u002F1.jpg\",\n      \"mllm_demo_data\u002F2.jpg\"\n    ]\n  }\n]\n```\n\n1. The content inside `\u003Cthink> ... \u003C\u002Fthink>` will **not** be stored as conversation history or in fine-tuning data.\n2. The `\u003Cimage>` tag will be replaced with the corresponding image information.\n3. For the GLM-4.5V model, the \u003Canswer> and \u003C\u002Fanswer> tags should be removed.\n\nThen, you can fine-tune following the standard LLaMA-Factory procedure.\n\n## Model Overview\n\n### GLM-4.6V\n\nGLM-4.6V series model includes two versions: GLM-4.6V (106B), a foundation model designed for cloud and high-performance\ncluster scenarios,\nand GLM-4.6V-Flash (9B), a lightweight model optimized for local deployment and low-latency applications.\nGLM-4.6V scales its context window to 128k tokens in training,\nand achieves SoTA performance in visual understanding among models of similar parameter scales.\nCrucially, we integrate native Function Calling capabilities for the first time.\nThis effectively bridges the gap between \"visual perception\" and \"executable action\"\nproviding a unified technical foundation for multimodal agents in real-world business scenarios.\n\n![GLM-4.6V Benchmarks](resources\u002Fbench_46v.jpeg)\n\nBeyond achieves SoTA performance across major multimodal benchmarks at comparable model scales. GLM-4.6V introduces\nseveral key features:\n\n- **Native Multimodal Function Calling**\nEnables native vision-driven tool use. Images, screenshots, and document pages can be passed directly as tool inputs without text conversion, while visual outputs (charts, search images, rendered pages) are interpreted and integrated into the reasoning chain. This closes the loop from perception to understanding to execution.\n\n- **Interleaved Image-Text Content Generation**\nSupports high-quality mixed media creation from complex multimodal inputs. GLM-4.6V takes a multimodal context—spanning documents, user inputs, and tool-retrieved images—and synthesizes coherent, interleaved image-text content tailored to the task. During generation it can actively call search and retrieval tools to gather and curate additional text and visuals, producing rich, visually grounded content.\n\n- **Multimodal Document Understanding**\nGLM-4.6V can process up to 128K tokens of multi-document or long-document input, directly interpreting richly formatted pages as images. It understands text, layout, charts, tables, and figures jointly, enabling accurate comprehension of complex, image-heavy documents without requiring prior conversion to plain text.\n\n- **Frontend Replication & Visual Editing**\nReconstructs pixel-accurate HTML\u002FCSS from UI screenshots and supports natural-language-driven edits. It detects layout, components, and styles visually, generates clean code, and applies iterative visual modifications through simple user instructions.\n\n### GLM-4.5V\n\nGLM-4.5V is based on ZhipuAI’s GLM-4.5-Air.\nIt continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance among models of the same scale on\n42 public vision-language benchmarks.\nIt covers common tasks such as image, video, and document understanding, as well as GUI agent operations.\n\nBeyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle\ndiverse types of visual content, enabling full-spectrum vision reasoning, including:\n\n- **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition)\n- **Video understanding** (long video segmentation and event recognition)\n- **GUI tasks** (screen reading, icon recognition, desktop operation assistance)\n- **Complex chart & long document parsing** (research report analysis, information extraction)\n- **Grounding** (precise visual element localization)\n\nThe model also introduces a **Thinking Mode** switch, allowing users to balance between quick responses and deep\nreasoning. This switch works the same as in the `GLM-4.5` language model.\n\n### GLM-4.1V-9B\n\nBuilt on the [GLM-4-9B-0414](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-4) foundation model, the **GLM-4.1V-9B-Thinking** model\nintroduces a reasoning paradigm and uses RLCS (Reinforcement Learning with Curriculum Sampling) to comprehensively\nenhance model capabilities.\nIt achieves the strongest performance among 10B-level VLMs and matches or surpasses the much larger Qwen-2.5-VL-72B in\n18 benchmark tasks.\n\nWe also open-sourced the base model **GLM-4.1V-9B-Base** to support researchers in exploring the limits of\nvision-language model capabilities.\n\n![rl](resources\u002Frl.jpeg)\n\nCompared with the previous generation CogVLM2 and GLM-4V series, **GLM-4.1V-Thinking** brings:\n\n1. The series’ first reasoning-focused model, excelling in multiple domains beyond mathematics.\n2. **64k** context length support.\n3. Support for **any aspect ratio** and up to **4k** image resolution.\n4. A bilingual (Chinese\u002FEnglish) open-source version.\n\nGLM-4.1V-9B-Thinking integrates the **Chain-of-Thought** reasoning mechanism, improving accuracy, richness, and\ninterpretability.\nIt leads on 23 out of 28 benchmark tasks at the 10B parameter scale, and outperforms Qwen-2.5-VL-72B on 18 tasks despite\nits smaller size.\n\n## Remaining Issues\n\nSince the open-sourcing of GLM-4.1V, we have received extensive feedback from the community and are well aware that the model still has many shortcomings. In subsequent iterations, we attempted to address several common issues — such as repetitive thinking outputs and formatting errors — which have been mitigated to some extent in this new version.\n\nHowever, the model still has several limitations and issues that we will fix as soon as possible:\n\n1. Pure text QA capabilities still have significant room for improvement. In this development cycle, our primary focus was on visual multimodal scenarios, and we will enhance pure text abilities in upcoming updates.\n2. The model may still overthink or even repeat itself in certain cases, especially when dealing with complex prompts.\n3. In some situations, the model may restate the answer again at the end.\n4. There remain certain perception limitations, such as counting accuracy and identifying specific individuals, which still require improvement.\n\nThank you for your patience and understanding. We also welcome feedback and suggestions in the issue section — we will respond and improve as much as we can!\n\n## Citation\n\nIf you use this model, please cite the following paper:\n\n```bibtex\n@misc{vteam2025glm45vglm41vthinkingversatilemultimodal,\n      title={GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning},\n      author={V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Bin Chen and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiale Zhu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingdao Liu and Mingde Xu and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Tianyu Tong and Wenkai Li and Wei Jia and Xiao Liu and Xiaohan Zhang and Xin Lyu and Xinyue Fan and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yanzi Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuting Wang and Yu Wang and Yuxuan Zhang and Zhao Xue and Zhenyu Hou and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang},\n      year={2025},\n      eprint={2507.01006},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006},\n}\n```\n","GLM-V 是一个面向多模态推理的视觉-语言模型项目，旨在通过可扩展的强化学习技术提升模型在复杂问题解决、长上下文理解和多模态代理中的表现。该项目的核心功能包括图像到文本转换、视频理解以及高级推理能力，并且支持多种应用场景，如桌面助手应用、UI 编码等。采用 Python 语言开发，适合需要处理和理解多媒体信息并进行复杂决策的开发者使用。",2,"2026-06-11 03:42:21","high_star"]