[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-78376":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":33,"discoverSource":34},78376,"Thinking-with-Visual-Primitives","ailuntx\u002FThinking-with-Visual-Primitives","ailuntx","Archived snapshot of Thinking-with-Visual-Primitives","https:\u002F\u002Fwww.deepseek.com",null,"Makefile",251,60,3,0,2,12,120,6,65.36,"MIT License",false,"main",[25,26,27,28,29],"deepseek","grounding","multimodal","spatial-reasoning","vision-language-model","2026-06-12 04:01:23","\u003C!-- markdownlint-disable first-line-h1 -->\n\u003C!-- markdownlint-disable html -->\n\u003C!-- markdownlint-disable no-duplicate-header -->\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"images\u002Flogo.svg\" width=\"60%\" alt=\"DeepSeek LLM\" \u002F>\n\u003C\u002Fdiv>\n\u003Chr>\n\n\u003Cdiv align=\"center\">\n\u003Ch1>Thinking with Visual Primitives\u003C\u002Fh1>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n  \u003Ca href=\"https:\u002F\u002Fwww.deepseek.com\u002F\" target=\"_blank\">\n    \u003Cimg alt=\"Homepage\" src=\"images\u002Fbadge.svg\" \u002F>\n  \u003C\u002Fa>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\" target=\"_blank\">\n    \u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\".\u002FREADME.md\">\u003Cb>English\u003C\u002Fb>\u003C\u002Fa> |\n  \u003Ca href=\".\u002FREADME_zh.md\">\u003Cb>简体中文\u003C\u002Fb>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n\n\u003Cdiv align=\"center\">\n\n  \u003Ca href=\"LICENSE-CODE\">\n    \u003Cimg alt=\"Code License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode_License-MIT-f5de53?&color=f5de53\">\n  \u003C\u002Fa>\n  \u003Ca href=\"LICENSE-MODEL\">\n    \u003Cimg alt=\"Model License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel_License-Model_Agreement-f5de53?&color=f5de53\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#2-license\">\u003Cb>📜 License\u003C\u002Fb>\u003C\u002Fa> |\n  \u003Ca href=\"#3-citation\">\u003Cb>📖 Citation\u003C\u002Fb>\u003C\u002Fa> \u003Cbr>\n  \u003C!-- 📄 Paper Link (\u003Ca href=\"\">\u003Cb>Thinking with Visual Primitives\u003C\u002Fb>\u003C\u002Fa> | -->\n\n\u003C\u002Fp>\n\n> [!IMPORTANT]\n> This repository was originally obtained from a source repository previously associated with `charlesCXK`, which is currently unavailable.\n>\n> The original upstream\u002Ffork relationship is no longer reliably preserved. This repository should be treated as a community mirror\u002Farchive rather than an authoritative source.\n>\n> There is currently no known replacement official repository for this project. Please follow future updates or any re-release from the following sources:\n>\n> - the `charlesCXK` profile: https:\u002F\u002Fgithub.com\u002FcharlesCXK\n> - the DeepSeek organization: https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\n\n\n## News\n\n**2026.05.22**: As of May 22, 2026, the original source repository and the previously referenced DeepSeek organization repository for `Thinking-with-Visual-Primitives` are unavailable. No official replacement repository or re-release announcement has been found. Please follow the `charlesCXK` profile and the DeepSeek organization for future updates.\n\n**2026.04.30**: We have released the [technical report](.\u002FThinking_with_Visual_Primitives.pdf) detailing our approach. In the near future, we plan to make the in-house benchmarks and a subset of our cold-start data publicly available. The model weights will be integrated into our foundation model and released in the future.\n\n\n\n## 1. Introduction\nWhile recent Multimodal Large Language Models (MLLMs) have made strides in bridging the *\"Perception Gap\"* (e.g., through high-resolution cropping or thinking with images), they still struggle with complex structural reasoning. We identify this bottleneck as the **Reference Gap**: natural language is simply too ambiguous to precisely point to dense spatial layouts, often leading to logical collapse and hallucinations in thinking process.\n\nThis project introduces a paradigm shift. Instead of just \"seeing clearer\", our model learns to **\"point while it reasons.\"** By interleaving spatial markers (points and bounding boxes) directly into the reasoning trajectory as *minimal units of thought*, we anchor abstract linguistic concepts to concrete physical coordinates.\n\n\u003Ctable align=\"center\">\n  \u003Ctr>\n    \u003Ctd align=\"center\" valign=\"top\">\n      \u003Cimg src=\".\u002Fimages\u002Fcoffee.gif\" style=\"height:250px; width:auto; max-width:none;\" \u002F>\u003Cbr>      \n      \u003Cb>Grounded Task Reasoning\u003C\u002Fb>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" valign=\"top\">\n      \u003Cimg src=\".\u002Fimages\u002Fmaze.gif\" style=\"height:250px; width:auto; max-width:none;\" \u002F>\u003Cbr>\n      \u003Cb>Topological Reasoning\u003C\u002Fb>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\n### Key Highlights\n\n*  **Point-to-Reason Synergy:** Mimicking human cognitive behavior (like using a finger to count or trace a maze), our framework elevates visual primitives to minimal units of thought, effectively solving the Reference Gap in complex structural reasoning.\n*  **Extreme Visual Token Efficiency:** Built upon the architecture of DeepSeek-V4-Flash, we compress the KV cache of every 4 visual tokens into a single entry, drastically reducing image token consumption while maintaining cognitive depth.\n*  **Frontier-Competitive Performance:** Despite a compact model scale and a significantly lower image-token budget, our model matches frontier models like **GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash** across challenging counting and spatial reasoning benchmarks. (We note that the reported scores cover only a subset of evaluation dimensions that are directly relevant to the research focus of this paper, and are therefore not indicative of the models' overall capabilities.)\n\n\n\u003Cdiv align=\"center\">\n\u003Cimg alt=\"image\" src=\"images\u002Fteaser.png\" style=\"width:90%;\">\n\u003C\u002Fdiv>\n\n\n\n\n\n\n## 2. License\n\nThis code repository is licensed under [the MIT License](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-LLM\u002Fblob\u002FHEAD\u002FLICENSE-CODE).\n\n## 3. Citation\n\n```bibtex\n@article{lu2026think,\n  title={Thinking with Visual Primitives},\n  author={Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and Luo, Lingxiao and Wu, Zhiyu and Pan, Zizheng and Liu, Xingchao and Lin, Yutong and Li, Hao and Liu, Wen and Hao, Zhewen and Gao, Xi and Nie, Shaoheng and Wei, Yixuan and Xie, Zhenda and Chen, Ting and Zeng, Gang},\n  year={2026}\n}\n\n```\n\n## 4. Contact\n\nIf you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).\n","Thinking with Visual Primitives 是一个专注于提升多模态大语言模型在复杂结构推理能力的项目。它通过引入视觉原语来增强模型对密集空间布局的理解，从而解决自然语言描述中的模糊性问题。该项目采用的技术手段包括高分辨率裁剪和基于图像的思考方式，旨在弥补现有模型在感知与引用之间的差距。适合于需要进行精确视觉理解和空间推理的应用场景，如机器人导航、图像理解与生成等领域。","2026-06-11 03:56:46","CREATED_QUERY"]