[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80025":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":15,"starSnapshotCount":15,"syncStatus":14,"lastSyncTime":31,"discoverSource":32},80025,"Thinking-with-Visual-Primitives-pytorch","vra\u002FThinking-with-Visual-Primitives-pytorch","vra","Unofficial PyTorch reproduction of DeepSeek's Thinking with Visual Primitives.  ","",null,"Python",75,6,2,0,4,8,12,2.54,"MIT License",false,"main",[24,25,26,27],"deepseek","llms","opd","pytorch","2026-06-12 02:03:57","# Thinking with Visual Primitives — PyTorch Implementation\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyunfengwang\u002FTVP-Demo\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%8E%AF%20Demo-HF%20Spaces-ff69b4\" alt=\"Demo\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Models-yellow\" alt=\"Hugging Face\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyunfengwang\u002FTVP-Training-Data\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%93%A6%20Dataset-HF-orange\" alt=\"Dataset\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmitkox\u002FThinking-with-Visual-Primitives\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%93%91%20Original-Repo-b31b1b\" alt=\"Original\">\u003C\u002Fa>\n    \u003Ca href=\"README_ZH.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%87%A8%F0%9F%87%B3%20%E4%B8%AD%E6%96%87-README-blue\" alt=\"中文文档\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyunfengwang\u002FTVP-Demo\">Online Demo\u003C\u002Fa> |\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\u002FTVP-OPD-Qwen2VL-2B\">OPD Model\u003C\u002Fa> |\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\u002FTVP-SFTBox-Qwen2VL-2B\">SFT Box Expert\u003C\u002Fa> |\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\u002FTVP-SFTPoint-Qwen2VL-2B\">SFT Point Expert\u003C\u002Fa> |\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\u002FTVP-Pretrain-Qwen2VL-2B\">Pretrain\u003C\u002Fa> |\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyunfengwang\u002FTVP-Training-Data\">Dataset\u003C\u002Fa>\n\u003C\u002Fp>\n\n> Unofficial PyTorch reproduction of [*Thinking with Visual Primitives*](https:\u002F\u002Fgithub.com\u002Fmitkox\u002FThinking-with-Visual-Primitives).\n\n> **Note**: Due to compute constraints, all training stages use **LoRA** fine-tuning instead of full-parameter training. This is a **feasibility verification** of the pipeline — results demonstrate the approach works but have room for improvement with more compute (full fine-tuning, larger datasets, bigger models).\n\nThis project implements a multi-stage training pipeline that teaches multimodal LLMs to reason with **bounding boxes** and **points** as first-class \"thought units\" — interleaving spatial coordinates within chain-of-thought to close the **Reference Gap** in visual reasoning.\n\n## Overview\n\n```\nStage 1: Pretraining        — Learn to output visual primitive format\nStage 2: Specialized SFT    — Expert fine-tuning (Box expert + Point expert)\nStage 3: On-Policy Distill  — Distill both experts into a unified model\n```\n\nThe model outputs structured thinking with embedded coordinates:\n\n```\n1. Analyzing the request\nThe user asks me to locate the cat in this image.\n2. Object grounding\nI see a \u003C|ref|>cat\u003C|\u002Fref|>\u003C|box|>[[370,334,408,497]]\u003C|\u002Fbox|>.\n3. Conclusion\nThe cat is located at the specified coordinates.\n```\n\n## Models\n\n| Model | HuggingFace | Description |\n|-------|-------------|-------------|\n| Pretrain | [yunfengwang\u002FTVP-Pretrain-Qwen2VL-2B](https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\u002FTVP-Pretrain-Qwen2VL-2B) | Base model with visual primitive format |\n| SFT Box Expert | [yunfengwang\u002FTVP-SFTBox-Qwen2VL-2B](https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\u002FTVP-SFTBox-Qwen2VL-2B) | Grounding specialist (counting, spatial) |\n| SFT Point Expert | [yunfengwang\u002FTVP-SFTPoint-Qwen2VL-2B](https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\u002FTVP-SFTPoint-Qwen2VL-2B) | Pointing specialist (maze, path tracing) |\n| OPD Unified | [yunfengwang\u002FTVP-OPD-Qwen2VL-2B](https:\u002F\u002Fhuggingface.co\u002Fyunfengwang\u002FTVP-OPD-Qwen2VL-2B) | Final distilled model (both capabilities) |\n\n## Example Results\n\nWe compare the model outputs across three training stages on the same images, showing complete model responses.\n\n### Grounding: \"Locate the dog in the image.\"\n\n\u003Ctable>\n\u003Ctr>\u003Cth>Stage\u003C\u002Fth>\u003Cth>Complete Output\u003C\u002Fth>\u003Cth>Visualization\u003C\u002Fth>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>Pretrain\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n\u003C|ref|>[[108,64,999,999]]\u003C|\u002Fref|>\n```\nNo label, wrong tag usage, oversized box (covers entire image).\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fpretrain_img_004.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>SFT Box\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to locate the dog in this image.\n2. **Object grounding**\nI see a \u003C|ref|>dog\u003C|\u002Fref|>\u003C|box|>[[112,68,782,953]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThe dog is located at the specified coordinates.\nThe dog is located at [[112,68,782,953]].\n```\nStructured thinking with accurate box tightly around the dog.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fsft_box_img_004.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>OPD\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to locate the dog in this image.\n2. **Object grounding**\nI see a \u003C|ref|>dog\u003C|\u002Fref|>\u003C|box|>[[112,68,782,953]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThe dog is located at the specified coordinates.\nThe dog is located at [[112,68,782,953]].\n```\nIdentical to SFT Box — box coordinates match exactly.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fopd_img_004.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### Grounding: \"Locate the person in the image.\"\n\n\u003Ctable>\n\u003Ctr>\u003Cth>Stage\u003C\u002Fth>\u003Cth>Complete Output\u003C\u002Fth>\u003Cth>Visualization\u003C\u002Fth>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>Pretrain\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n\u003C|ref|>Person\u003C|\u002Fref|>\u003C|box|>[[480,201,999,999]]\u003C|\u002Fbox|>\n```\nNo structured thinking. Box is oversized (covers most of the image).\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fpretrain_img_001.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>SFT Box\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to locate the person in this image.\n2. **Object grounding**\nI see a \u003C|ref|>person\u003C|\u002Fref|>\u003C|box|>[[511,208,738,963]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThe person is located at the specified coordinates.\nThe person is located at [[511,208,738,963]].\n```\nStructured thinking with accurate box on the silhouette.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fsft_box_img_001.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>OPD\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to locate the person in this image.\n2. **Object grounding**\nAfter carefully scanning the entire image,\nI do not see any person present.\n3. **Conclusion**\nThere is no person in this image.\n```\nKnown limitation: OPD loses silhouette detection after multi-task distillation (SFT Box handles it correctly).\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fopd_img_001.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### Grounding: \"Locate the sports ball in the image.\"\n\n\u003Ctable>\n\u003Ctr>\u003Cth>Stage\u003C\u002Fth>\u003Cth>Complete Output\u003C\u002Fth>\u003Cth>Visualization\u003C\u002Fth>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>Pretrain\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n\u003C|ref|>[[278,267,729,759]]\u003C|\u002Fref|>\n```\nNo label, wrong tag usage, oversized box.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fpretrain_img_003.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>SFT Box\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to locate the sports ball in this image.\n2. **Object grounding**\nI see a \u003C|ref|>sports ball\u003C|\u002Fref|>\u003C|box|>[[277,244,479,510]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThe sports ball is located at the specified coordinates.\nThe sports ball is located at [[277,244,479,510]].\n```\nCorrect label, tight box, full structured thinking.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fsft_box_img_003.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>OPD\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to locate the sports ball in this image.\n2. **Object grounding**\nI see a \u003C|ref|>sports ball\u003C|\u002Fref|>\u003C|box|>[[277,244,479,510]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThe sports ball is located at the specified coordinates.\nThe sports ball is located at [[277,244,479,510]].\n```\nIdentical quality to SFT Box (KL=0.35).\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fopd_img_003.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### Counting: \"How many people are in the image?\"\n\n\u003Ctable>\n\u003Ctr>\u003Cth>Stage\u003C\u002Fth>\u003Cth>Complete Output\u003C\u002Fth>\u003Cth>Visualization\u003C\u002Fth>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>Pretrain\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\nThere are two people in the image.\n```\nCorrect count but no boxes or structured thinking.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fpretrain_img_000.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>SFT Box\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to count the person in this image.\n2. **Object grounding**\nI see 2 instance(s) of \u003C|ref|>person\u003C|\u002Fref|>\n\u003C|box|>[[338,121,630,923],[634,154,888,945]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThere are 2 person in this image.\n```\nStructured thinking with count + bounding boxes for each person.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fsft_box_img_000.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>OPD\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to count the person in this image.\n2. **Object grounding**\nI see 2 instance(s) of \u003C|ref|>person\u003C|\u002Fref|>\n\u003C|box|>[[339,130,632,923],[642,157,887,940]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThere are 2 person in this image.\n```\nPreserves SFT Box quality — structured counting with boxes.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fopd_img_000.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### Counting: \"How many sports balls are in the image?\"\n\n\u003Ctable>\n\u003Ctr>\u003Cth>Stage\u003C\u002Fth>\u003Cth>Complete Output\u003C\u002Fth>\u003Cth>Visualization\u003C\u002Fth>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>Pretrain\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\nThere are four sports balls in the image.\n```\nCorrect count, no boxes.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fpretrain_img_002.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>SFT Box\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to count out sports balls in this image.\n2. **Object grounding**\nI see 4 instance(s) of \u003C|ref|>sports ball\u003C|\u002Fref|>\n\u003C|box|>[[386,265,653,587],[87,497,445,752],\n[448,587,682,878],[642,456,904,695]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThere are 4 sports ball in this image.\n```\nStructured thinking with 4 bounding boxes — fixed by prompt template diversification.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fsft_box_img_002.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>\u003Cb>OPD\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd>\n\n```\n1. **Analyzing the request**\nThe user asks me to count out sports balls in this image.\n2. **Object grounding**\nI see 4 instance(s) of \u003C|ref|>sports ball\u003C|\u002Fref|>\n\u003C|box|>[[386,263,653,587],[87,497,445,752],\n[510,571,675,878],[643,453,904,695]]\u003C|\u002Fbox|>.\n3. **Conclusion**\nThere are 4 sports ball in this image.\n```\nPreserves counting ability with accurate boxes.\n\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"test-images\u002Fresults\u002Fopd_img_002.jpg\" width=\"250\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n**Key observations:**\n- **Pretrain** learns the visual primitive token format but produces oversized boxes and no structured thinking\n- **SFT Box** adds structured thinking (Analyzing → Grounding → Conclusion) with accurate bounding boxes; counting now produces boxes via prompt template diversification (3→12 templates with plural forms)\n- **OPD** combines box and point capabilities into one model via distillation, but may lose some SFT Box quality on edge cases (e.g., silhouette detection) due to multi-task trade-offs\n- **Prompt diversity fix**: expanding counting templates from 3 to 12 (with singular\u002Fplural, \"the\"\u002F\"this\", varied verbs) fixed the issue where \"How many sports balls are in the image?\" produced plain text instead of structured thinking with boxes\n- **Distillation tuning**: lr=5e-7 + temperature=1.0 + positive-only grounding data prevents catastrophic forgetting during multi-task distillation\n- **neg_ratio tuning**: reducing negative sample ratio from 0.30 to 0.15 in SFT fixed over-rejection\n\n## Quick Start\n\n### Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FYOUR_USERNAME\u002FThinking-with-Visual-Primitives-pytorch.git\ncd Thinking-with-Visual-Primitives-pytorch\n\nconda create -n vprim python=3.10 -y\nconda activate vprim\npip install -r requirements.txt\n```\n\n**Requirements**: Python ≥ 3.9, CUDA ≥ 11.8, GPU with 12GB+ VRAM (tested on RTX 4070 Ti 12GB).\n\n### Gradio Demo\n\n```bash\n# Interactive web demo with visualization\npython app.py --model_path outputs\u002Fopd\u002Ffinal --load_in_4bit\n\n# Or use HuggingFace model directly\npython app.py --model_path yunfengwang\u002FTVP-OPD-Qwen2VL-2B --load_in_4bit\n\n# Public shareable link\npython app.py --model_path outputs\u002Fopd\u002Ffinal --load_in_4bit --share\n```\n\nUpload an image, type a prompt (e.g., \"Locate the cat\"), and see the model's structured reasoning with bounding boxes drawn on the image.\n\n### Inference (CLI)\n\n```bash\n# Single image inference\npython scripts\u002Finference_demo.py \\\n    --model_path outputs\u002Fopd\u002Ffinal \\\n    --image your_image.jpg \\\n    --prompt \"Locate the person in the image.\"\n\n# With 4-bit quantization (saves VRAM)\npython scripts\u002Finference_demo.py \\\n    --model_path outputs\u002Fopd\u002Ffinal \\\n    --image your_image.jpg \\\n    --prompt \"Locate the person in the image.\" \\\n    --load_in_4bit\n\n# Batch inference on JSONL\npython scripts\u002Finference_demo.py \\\n    --model_path outputs\u002Fopd\u002Ffinal \\\n    --jsonl data\u002Fsft\u002Fcounting\u002Fcounting_data.jsonl \\\n    --image_root data\u002Fcoco\u002Fval \\\n    --max_samples 10\n```\n\n### Inference from Python\n\n```python\nimport torch\nfrom PIL import Image\nfrom model import VisualPrimitiveVLM\nfrom transformers import AutoProcessor\n\nmodel = VisualPrimitiveVLM.from_pretrained(\"yunfengwang\u002FTVP-OPD-Qwen2VL-2B\", device_map=\"cuda\")\nmodel.eval()\ntokenizer = model.tokenizer\nprocessor = AutoProcessor.from_pretrained(\n    model.base_model_path, trust_remote_code=True\n)\n\nimage = Image.open(\"your_image.jpg\").convert(\"RGB\")\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": [\n        {\"type\": \"image\", \"image\": \"your_image.jpg\"},\n        {\"type\": \"text\", \"text\": \"Locate the cat in the image.\"},\n    ]},\n]\ntext = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\ninputs = processor(text=[text], images=[image], return_tensors=\"pt\", padding=True)\ninputs = {k: v.to(model.vlm.device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}\n\nwith torch.no_grad():\n    output_ids = model.vlm.generate(**inputs, max_new_tokens=256, do_sample=False)\n\nnew_tokens = output_ids[:, inputs[\"input_ids\"].shape[1]:]\nresponse = tokenizer.batch_decode(new_tokens, skip_special_tokens=False)[0]\nprint(response)\n```\n\n## Full Reproduction\n\n### Step 1: Prepare Data\n\n```bash\npython scripts\u002Fprepare_all_data.py \\\n    --output_dir data \\\n    --coco_split val \\\n    --coco_subset 5000 \\\n    --num_counting 2000 \\\n    --num_spatial 2000 \\\n    --num_maze 5000 \\\n    --num_path 3000\n```\n\nThis downloads COCO 2017 val (~1GB) and generates all training data:\n\n```\ndata\u002F\n├── coco\u002Fval\u002Fimages\u002F                    # COCO images\n├── pretrain\u002Fgrounding.jsonl            # Pretrain grounding data (~14K)\n├── sft\u002F\n│   ├── counting\u002Fcounting_data.jsonl    # Counting with boxes (2K)\n│   ├── spatial\u002F                        # CLEVR-style spatial reasoning (2K)\n│   ├── maze\u002F                           # Procedural mazes (5K)\n│   ├── path\u002F                           # Path tracing (3K)\n│   └── grounding\u002Fsft_grounding.jsonl   # Grounding with negatives (10K)\n```\n\nThen generate the SFT grounding data with negative samples:\n\n```bash\npython scripts\u002Fgenerate_sft_grounding_data.py \\\n    --coco_jsonl data\u002Fpretrain\u002Fgrounding.jsonl \\\n    --image_root data\u002Fcoco\u002Fval \\\n    --output data\u002Fsft\u002Fgrounding\u002Fsft_grounding.jsonl \\\n    --neg_ratio 0.30\n```\n\n### Step 2: Pretraining\n\nTeaches the base model to output visual primitive tokens (`\u003C|box|>`, `\u003C|point|>`, `\u003C|ref|>`).\n\n```bash\npython pretraining\u002Ftrain_pretrain.py \\\n    --config configs\u002Fpretrain_12g.yaml \\\n    --output_dir outputs\u002Fpretrain\n```\n\n| Config | Value |\n|--------|-------|\n| Base model | Qwen\u002FQwen2-VL-2B-Instruct |\n| LoRA | r=16, alpha=32 |\n| Epochs | 3 |\n| Effective batch | 2 × 8 = 16 |\n| ~Time (12GB GPU) | ~1 hour |\n\n### Step 3: Specialized SFT\n\nTrain two expert models from the pretrained checkpoint:\n\n```bash\n# Box expert (grounding, counting, spatial)\npython sft\u002Ftrain_sft_box.py \\\n    --config configs\u002Fsft_box_12g.yaml \\\n    --output_dir outputs\u002Fsft_box\n\n# Point expert (maze navigation, path tracing)\npython sft\u002Ftrain_sft_point.py \\\n    --config configs\u002Fsft_point_12g.yaml \\\n    --output_dir outputs\u002Fsft_point\n```\n\n| Config | Box Expert | Point Expert |\n|--------|-----------|-------------|\n| Data | 10K grounding (7K pos + 3K neg) | Maze + Path |\n| LoRA | r=64, alpha=128 | r=64, alpha=128 |\n| Epochs | 5 | 5 |\n| ~Time | ~2.5 hours | ~2.5 hours |\n\n### Step 4: On-Policy Distillation (OPD)\n\nDistill both experts into a single unified model using forward KL divergence with task-adaptive teacher routing:\n\n```bash\nPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \\\npython unified\u002Ftrain_opd.py \\\n    --config configs\u002Fopd_12g.yaml \\\n    --output_dir outputs\u002Fopd\n```\n\n| Config | Value |\n|--------|-------|\n| Student | outputs\u002Fpretrain\u002Ffinal |\n| Teachers | sft_box\u002Ffinal + sft_point\u002Ffinal |\n| Loss | Forward KL + CE (ce_coeff=0.5) |\n| Temperature | 1.5 |\n| Routing | Task-adaptive (box tasks→box teacher, point tasks→point teacher) |\n| Epochs | 3 |\n| ~Time | ~1.5 hours |\n\n### Step 5: Evaluate\n\n```bash\n# Evaluate on counting\npython evaluation\u002Frun_eval.py \\\n    --model_path outputs\u002Fopd\u002Ffinal \\\n    --task counting \\\n    --data_path data\u002Fsft\u002Fcounting\u002Fcounting_data.jsonl \\\n    --image_root data\u002Fcoco\u002Fval \\\n    --output outputs\u002Fopd\u002Feval_counting.json\n\n# Evaluate on maze\npython evaluation\u002Frun_eval.py \\\n    --model_path outputs\u002Fopd\u002Ffinal \\\n    --task maze \\\n    --data_path data\u002Fsft\u002Fmaze\u002Fmaze_data.jsonl \\\n    --output outputs\u002Fopd\u002Feval_maze.json\n\n# Visual comparison across all stages\npython scripts\u002Fcompare_models.py\n```\n\nSupported tasks: `counting`, `spatial`, `maze`, `path`, `all`\n\n## Visual Primitives Format\n\nCoordinates are normalized integers in `[0, 999]`:\n\n```\n# Bounding box\n\u003C|ref|>cat\u003C|\u002Fref|>\u003C|box|>[[x1,y1,x2,y2]]\u003C|\u002Fbox|>\n\n# Multiple boxes\n\u003C|ref|>person\u003C|\u002Fref|>\u003C|box|>[[130,50,400,800],[500,60,750,790]]\u003C|\u002Fbox|>\n\n# Point\n\u003C|point|>[[x,y]]\u003C|\u002Fpoint|>\n\n# Point sequence (path\u002Fmaze)\n\u003C|point|>[[100,200],[150,250],[200,300]]\u003C|\u002Fpoint|>\n```\n\n## Project Structure\n\n```\n├── configs\u002F                    # Training configs (*_12g.yaml for 12GB GPUs)\n├── model\u002F\n│   ├── vl_model.py             # VisualPrimitiveVLM wrapper (PEFT, quantization)\n│   ├── special_tokens.py       # Visual primitive token definitions\n│   ├── spatial_compression.py  # 3×3 spatial compression module\n│   └── vision_projector.py     # Vision-language projector\n├── data\u002F\n│   ├── datasets_pretrain.py    # Pretrain dataset\n│   ├── datasets_sft.py         # SFT dataset (JSONL-based)\n│   ├── collators.py            # Conversation collator with assistant-only masking\n│   └── transforms.py           # Image transforms\n├── pretraining\u002F\n│   └── train_pretrain.py\n├── sft\u002F\n│   ├── train_sft_box.py        # Box expert SFT\n│   └── train_sft_point.py      # Point expert SFT\n├── rl\u002F\n│   ├── grpo_trainer.py         # GRPO trainer\n│   ├── reward_models.py        # Format\u002FQuality\u002FAccuracy reward models\n│   ├── train_rl_box.py         # Box expert RL (optional)\n│   └── train_rl_point.py       # Point expert RL (optional)\n├── unified\u002F\n│   ├── train_opd.py            # On-Policy Distillation\n│   ├── train_rft.py            # Rejection Fine-Tuning (optional)\n│   └── generate_rft_data.py\n├── evaluation\u002F\n│   ├── run_eval.py             # Unified evaluation entry point\n│   └── metrics.py              # Task-specific metrics\n├── scripts\u002F\n│   ├── prepare_all_data.py     # One-command data preparation\n│   ├── compare_models.py       # Visual comparison across stages\n│   ├── inference_demo.py       # Inference demo\n│   ├── generate_maze_data.py   # Procedural maze generation\n│   ├── generate_path_data.py   # Path tracing data generation\n│   └── regenerate_data.py      # Regenerate data with correct normalization\n└── utils\u002F\n    ├── visualization.py        # Draw boxes\u002Fpoints on images\n    ├── coco_categories.py      # COCO category ID→name mapping\n    ├── checkpoint.py           # Save\u002Fload checkpoints\n    └── logging.py\n```\n\n## VRAM Guide\n\n| GPU VRAM | Config suffix | Key settings |\n|----------|---------------|-------------|\n| 24GB | `*.yaml` | batch=2, LoRA r=128, image_size=448 |\n| 16GB | — | batch=1, LoRA r=64, image_size=384 |\n| **12GB** | `*_12g.yaml` | batch=1, LoRA r=64, image_size=336, max_length=1024 |\n\nFor 12GB GPUs, the collator pre-resizes images to 336×336 before the VL processor, capping visual tokens at ~576 (vs ~1500 at native resolution). Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to reduce fragmentation.\n\nFor OPD (3 models loaded simultaneously), teachers use 4-bit quantization automatically.\n\n## Training Pipeline Details\n\n### Pretraining\n- **Data**: COCO detection annotations → `\u003C|ref|>label\u003C|\u002Fref|>\u003C|box|>[[x1,y1,x2,y2]]\u003C|\u002Fbox|>`\n- **Training**: LoRA on LLM, freeze ViT\n- **Loss**: Standard cross-entropy (next-token prediction)\n\n### Specialized SFT\nTwo experts with structured thinking templates:\n\n| Expert | Tasks | Primitive | Thinking Template |\n|--------|-------|-----------|-------------------|\n| Box (FTwG) | Counting, Spatial, Grounding | `\u003C\\|box\\|>` | Intent → Grounding → Conclusion |\n| Point (FTwP) | Maze, Path Tracing | `\u003C\\|point\\|>` | DFS exploration \u002F waypoint sequence |\n\n### On-Policy Distillation\n- **Forward KL** with temperature scaling: `D_KL(teacher ‖ student)`\n- **Task-adaptive routing**: each sample goes to its relevant expert teacher only\n- **CE regularization**: prevents catastrophic forgetting (ce_coeff=0.5)\n\n### Optional: RL with GRPO\nThe RL stage uses Group Relative Policy Optimization with three reward models:\n\n| Reward Model | Type | Description |\n|-------------|------|-------------|\n| Format RM | Rule-based | Validates `\u003C\\|box\\|>`\u002F`\u003C\\|point\\|>` syntax |\n| Quality RM | Heuristic\u002FLLM | Checks redundancy, self-contradiction |\n| Accuracy RM | Task-specific | Counting: exponential error; Maze: multi-component; Path: bidirectional trajectory distance |\n\n## Citation\n\nIf you find this repository useful, please cite our implementation and the original project:\n\n```bibtex\n@software{wang2026tvp_pytorch,\n  title={Thinking with Visual Primitives — PyTorch Implementation},\n  author={Wang, Yunfeng},\n  url={https:\u002F\u002Fgithub.com\u002Fvra\u002FThinking-with-Visual-Primitives-pytorch},\n  year={2026}\n}\n```\n\nOriginal project: [mitkox\u002FThinking-with-Visual-Primitives](https:\u002F\u002Fgithub.com\u002Fmitkox\u002FThinking-with-Visual-Primitives)\n\n## License\n\nMIT License. See [LICENSE](LICENSE) for details.\n","该项目是对DeepSeek的Thinking with Visual Primitives的非官方PyTorch实现，旨在通过多阶段训练流程让多模态大语言模型学会使用边界框和点作为“思维单元”进行视觉推理。其核心功能包括预训练、专家微调以及在线策略蒸馏三个阶段，以结构化的方式输出嵌入坐标信息的思考过程。技术上采用了LoRA微调方法来验证整个流程的可行性。此项目适用于需要增强图像理解和交互能力的应用场景，如图像识别与定位等任务。","2026-06-11 03:58:57","CREATED_QUERY"]