[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-78039":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},78039,"Thinking-with-Visual-Primitives","mitkox\u002FThinking-with-Visual-Primitives","mitkox","Clone of DeepSeek Thinking-with-Visual-Primitives ",null,"Makefile",133,108,104,1,0,4,10,29,12,62.01,"MIT License",false,"main",true,[],"2026-06-12 04:01:23","\u003C!-- markdownlint-disable first-line-h1 -->\n\u003C!-- markdownlint-disable html -->\n\u003C!-- markdownlint-disable no-duplicate-header -->\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"images\u002Flogo.svg\" width=\"60%\" alt=\"DeepSeek LLM\" \u002F>\n\u003C\u002Fdiv>\n\u003Chr>\n\n\u003Cdiv align=\"center\">\n\u003Ch1>Thinking with Visual Primitives\u003C\u002Fh1>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n  \u003Ca href=\"https:\u002F\u002Fwww.deepseek.com\u002F\" target=\"_blank\">\n    \u003Cimg alt=\"Homepage\" src=\"images\u002Fbadge.svg\" \u002F>\n  \u003C\u002Fa>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\" target=\"_blank\">\n    \u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\n\n\u003Cdiv align=\"center\">\n\n  \u003Ca href=\"LICENSE-CODE\">\n    \u003Cimg alt=\"Code License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode_License-MIT-f5de53?&color=f5de53\">\n  \u003C\u002Fa>\n  \u003Ca href=\"LICENSE-MODEL\">\n    \u003Cimg alt=\"Model License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel_License-Model_Agreement-f5de53?&color=f5de53\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#2-license\">\u003Cb>📜 License\u003C\u002Fb>\u003C\u002Fa> |\n  \u003Ca href=\"#3-citation\">\u003Cb>📖 Citation\u003C\u002Fb>\u003C\u002Fa> \u003Cbr>\n  \u003C!-- 📄 Paper Link (\u003Ca href=\"\">\u003Cb>Thinking with Visual Primitives\u003C\u002Fb>\u003C\u002Fa> | -->\n\n\u003C\u002Fp>\n\n\n## News\n\n**2026.04.30**: We have released the [technical report](.\u002FThinking_with_Visual_Primitives.pdf) detailing our approach. In the near future, we plan to make the in-house benchmarks and a subset of our cold-start data publicly available. The model weights will be integrated into our foundation model and released in the future.\n\n\n\n## 1. Introduction\nWhile recent Multimodal Large Language Models (MLLMs) have made strides in bridging the *\"Perception Gap\"* (e.g., through high-resolution cropping or thinking with images), they still struggle with complex structural reasoning. We identify this bottleneck as the **Reference Gap**: natural language is simply too ambiguous to precisely point to dense spatial layouts, often leading to logical collapse and hallucinations in thinking process.\n\nThis project introduces a paradigm shift. Instead of just \"seeing clearer\", our model learns to **\"point while it reasons.\"** By interleaving spatial markers (points and bounding boxes) directly into the reasoning trajectory as *minimal units of thought*, we anchor abstract linguistic concepts to concrete physical coordinates.\n\n\u003Ctable align=\"center\">\n  \u003Ctr>\n    \u003Ctd align=\"center\" valign=\"top\">\n      \u003Cimg src=\".\u002Fimages\u002Fcoffee.gif\" style=\"height:250px; width:auto; max-width:none;\" \u002F>\u003Cbr>      \n      \u003Cb>Grounded Task Reasoning\u003C\u002Fb>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" valign=\"top\">\n      \u003Cimg src=\".\u002Fimages\u002Fmaze.gif\" style=\"height:250px; width:auto; max-width:none;\" \u002F>\u003Cbr>\n      \u003Cb>Topological Reasoning\u003C\u002Fb>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\n### Key Highlights\n\n*  **Point-to-Reason Synergy:** Mimicking human cognitive behavior (like using a finger to count or trace a maze), our framework elevates visual primitives to minimal units of thought, effectively solving the Reference Gap in complex structural reasoning.\n*  **Extreme Visual Token Efficiency:** Built upon the architecture of DeepSeek-V4-Flash, we compress the KV cache of every 4 visual tokens into a single entry, drastically reducing image token consumption while maintaining cognitive depth.\n*  **Frontier-Competitive Performance:** Despite a compact model scale and a significantly lower image-token budget, our model matches frontier models like **GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash** across challenging counting and spatial reasoning benchmarks. (We note that the reported scores cover only a subset of evaluation dimensions that are directly relevant to the research focus of this paper, and are therefore not indicative of the models' overall capabilities.)\n\n\n\u003Cdiv align=\"center\">\n\u003Cimg alt=\"image\" src=\"images\u002Fteaser.png\" style=\"width:90%;\">\n\u003C\u002Fdiv>\n\n\n\n\n\n\n## 2. License\n\nThis code repository is licensed under [the MIT License](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-LLM\u002Fblob\u002FHEAD\u002FLICENSE-CODE).\n\n## 3. Citation\n\n```bibtex\n@article{lu2026think,\n  title={Thinking with Visual Primitives},\n  author={Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and Luo, Lingxiao and Wu, Zhiyu and Pan, Zizheng and Liu, Xingchao and Lin, Yutong and Li, Hao and Liu, Wen and Hao, Zhewen and Gao, Xi and Nie, Shaoheng and Wei, Yixuan and Xie, Zhenda and Chen, Ting and Zeng, Gang},\n  year={2026}\n}\n\n```\n\n## 4. Contact\n\nIf you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).\n","该项目是一个基于视觉基本元素进行思考的多模态大语言模型的克隆实现，旨在解决现有模型在处理复杂结构推理时面临的“参考鸿沟”问题。其核心功能在于通过将空间标记（如点和边界框）直接嵌入到推理过程中作为最小思维单元，使得抽象的语言概念能够与具体的物理坐标相锚定，从而实现更加精准的空间定位与逻辑推理。技术上，该模型采用了类似人类认知行为的方式，比如用手指计数或追踪迷宫路径，来提升视觉基础元素在推理中的作用。适用于需要高精度空间布局理解和操作的应用场景，例如物体识别、路径规划等。",2,"2026-06-11 03:56:23","CREATED_QUERY"]