[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80839":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":13,"stars7d":15,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":15,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},80839,"llm-vision-for-scientists","bnsreenu\u002Fllm-vision-for-scientists","bnsreenu",null,"Jupyter Notebook",39,16,36,1,0,3,44.49,"MIT License",false,"main",true,[],"2026-06-12 04:01:30","# LLM Vision for Scientists\n\nA growing collection of tools and tutorials showing how large language models and vision-language models can be applied to real scientific image analysis workflows. Built alongside the *Applied LLMs for Scientists* video series on [DigitalSreeni](https:\u002F\u002Fwww.youtube.com\u002F@DigitalSreeni).\n\n## Video Playlist\n\n[Applied LLMs for Scientists — YouTube Playlist](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLZsOBAyNTZwYhwXhL8rqruLK_3mbf-CTX)\n\n---\n\n## Series Overview\n\n| Video | Topic | Code |\n|-------|-------|------|\n| 1 | Conceptual overview of LLM-assisted image annotation (slides only) | — |\n| 2 | Text-prompted object detection with Grounding DINO | `01_grounding_dino_bboxes.ipynb` |\n| 3 | Grounding DINO + SAM 2 segmentation pipeline | `02_dino_plus_sam2_masks.ipynb` |\n| 4 | Interactive annotation GUI (Grounding DINO + SAM 2) | `annotation_tool_v4.py` |\n| 5 | Fine-tuning Grounding DINO on scientific images | `finetune_gdino.py` |\n| 6 | Literature-informed object detection using RAG | `rag_literature_tool.py` |\n| 7 | SAM3 vs Grounding DINO + SAM 2: a three-way comparison | `annotation_tool_v5.py` |\n\n---\n\n## Repository Contents\n\n| File | Description |\n|------|-------------|\n| `01_grounding_dino_bboxes.ipynb` | Grounding DINO text-prompted detection: loads the model, runs inference, visualises bounding boxes with confidence scores. |\n| `02_dino_plus_sam2_masks.ipynb` | Extends notebook 1 by feeding detected boxes into SAM 2 to produce per-object segmentation masks. |\n| `annotation_tool_v4.py` | Annotation GUI (DINO + SAM 2): per-class thresholds, multi-phrase prompts, manual correction by point click, mask export, built-in COCO merge tool. |\n| `annotation_tool_v5.py` | Extended annotation GUI adding SAM3 as a second detection backend. Switch between DINO+SAM2 and SAM3 from a dropdown. Includes a prominent model status banner showing which backend is active. |\n| `finetune_gdino.py` | Fine-tuning GUI: adapts Grounding DINO to your domain using annotated COCO data, with live loss curves and a before\u002Fafter comparison tab. |\n| `rag_literature_tool.py` | Fully local RAG pipeline: upload scientific PDFs, ask questions, and use retrieved context to guide object detection. No API keys or internet required after setup. |\n| `download_models.py` | Downloads all model weights to a local folder before running offline. |\n\n---\n\n## Background\n\n### Grounding DINO\n\nGrounding DINO (Liu et al., 2023) is an open-set object detector that accepts free-form text rather than a fixed category list. It fuses a Swin Transformer visual backbone with a BERT-style text encoder, detecting any object described in natural language. A prompt like `\"glomerulus . renal glomerulus . small circular structure .\"` is enough to attempt detection without retraining.\n\n> Liu, S. et al. (2023). *Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.* arXiv:2303.05499.\n\nModels: `IDEA-Research\u002Fgrounding-dino-base`, `IDEA-Research\u002Fgrounding-dino-tiny`\n\n### SAM 2\n\nSAM 2 (Ravi et al., 2024) is Meta AI's second-generation segmentation model. Given a bounding box or a point click, it produces a high-quality binary mask. Used here in two modes: box-prompted (after Grounding DINO detection) and point-prompted (for manual correction clicks).\n\n> Ravi, N. et al. (2024). *SAM 2: Segment Anything in Images and Videos.* arXiv:2408.00714.\n\nModels: `facebook\u002Fsam2.1-hiera-small`, `facebook\u002Fsam2.1-hiera-base-plus`, `facebook\u002Fsam2.1-hiera-tiny`\n\n### SAM3\n\nSAM3 (Meta AI, 2025) is a unified vision-language segmentation model that accepts text prompts and produces segmentation masks directly, combining detection and segmentation in a single pass. It was evaluated here against the DINO+SAM2 pipeline on kidney histology images (H&E and IHC) under three conditions: SAM3 zero-shot, DINO+SAM2 zero-shot, and fine-tuned DINO+SAM2.\n\nModel: `facebook\u002Fsam3.1` — weights at `sam3.1_multiplex.pt` (~3.5 GB), runs on 16 GB VRAM via the Ultralytics API.\n\n### RAG-based Literature-Informed Detection\n\nThe RAG tool (video 6) adds a local knowledge base of scientific PDFs to the annotation workflow. PDFs are parsed with PyMuPDF, chunked, embedded with `all-MiniLM-L6-v2`, and stored in ChromaDB. At query time, relevant passages are retrieved and passed to Llama 3.2 (via Ollama) to synthesise detection guidance. The resulting text prompts are fed into Grounding DINO, grounding detection in domain literature rather than generic descriptions.\n\n---\n\n## Tool Highlights\n\n### Annotation Tool (v5)\n\nTwo detection backends selectable from a single dropdown:\n\n**DINO + SAM2:** Grounding DINO for text-prompted bounding-box detection, SAM 2 for mask segmentation. Supports per-class thresholds, multi-phrase prompts, and fine-tuned DINO checkpoints loaded via Browse. Manual Add clicks use SAM 2 point-prompt segmentation.\n\n**SAM3:** Meta's SAM 3.1 unified model. One model handles text-prompted detection and segmentation in a single pass. No DINO or SAM 2 needed.\n\nBoth backends share the same class table, phrase editor, Add\u002FDelete correction mode, mask overlay toggle (M key), and COCO export. A prominent model status banner always shows which backend and model variant is active — important when switching between backends during annotation sessions.\n\n### Fine-Tuning Tool\n\nFreezes the Swin Transformer backbone and fine-tunes only the Grounding DINO detection head, which keeps training stable on small datasets. 20 annotated images on an RTX 4000 Ada trains in under 5 minutes and reaches Val F1@IoU50 = 0.70 on kidney histology. Reads class names and phrases directly from `train.json` — nothing is hardcoded.\n\n### RAG Literature Tool\n\nFully local — no API keys, no cloud calls after initial model download. Streaming architecture processes one PDF at a time with a 200K character \u002F 500 chunk cap per file. Retrieved passages are displayed alongside the generated detection prompt so you can see exactly which literature guided the result.\n\n---\n\n## Full Annotation Pipeline\n\n```\nAnnotate images            Merge annotations       Fine-tune model\nannotation_tool_v5.py  →   Tools > Merge COCO  →   finetune_gdino.py\n        |                          |                        |\n per-image COCO JSONs        train.json              best_checkpoint\u002F\n binary mask PNGs            val.json                final_checkpoint\u002F\n                                                           |\n                                           Load back into annotation_tool_v5.py\n                                           via Browse button (DINO+SAM2 backend)\n```\n\n---\n\n## Installation\n\n### 1. Clone the repository\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fbnsreenu\u002Fllm-vision-for-scientists\ncd llm-vision-for-scientists\n```\n\n### 2. Install Python dependencies\n\nPython 3.10 or later is recommended.\n\n```bash\npip install PyQt5 torch torchvision transformers accelerate pillow numpy matplotlib\n```\n\nFor GPU acceleration (strongly recommended):\n\n```bash\npip install torch torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121\n```\n\nFor SAM3 support (video 7 onwards):\n\n```bash\npip install ultralytics timm\n```\n\nFor the RAG literature tool (video 6):\n\n```bash\npip install pymupdf sentence-transformers chromadb\n# Also install Ollama and pull llama3.2: https:\u002F\u002Follama.com\n```\n\n### 3. Download the models\n\nBy default the application expects models in `C:\\hf_models\\`. Edit `MODEL_BASE` at the top of any script to change this.\n\n```bash\npython download_models.py\n```\n\nThis populates:\n\n```\nC:\\hf_models\\\n    grounding-dino-base\\\n    grounding-dino-tiny\\\n    sam2-hiera-small\\\n    sam2-hiera-base-plus\\\n    sam2-hiera-tiny\\\n    sam3.1\\               # for video 7 — download separately from Meta\n```\n\n---\n\n## Running the Tools\n\n```bash\n# Annotation tool (DINO+SAM2 and SAM3 backends)\npython annotation_tool_v5.py\n\n# Fine-tuning tool\npython finetune_gdino.py\n\n# RAG literature tool\npython rag_literature_tool.py\n```\n\n---\n\n## Hardware Requirements\n\n| Component | Minimum | Recommended |\n|-----------|---------|-------------|\n| RAM | 8 GB | 16 GB |\n| GPU VRAM | 4 GB (CPU fallback) | 16–20 GB (SAM3 needs ~16 GB) |\n| Storage | 10 GB (all models) | — |\n\nDINO+SAM2 mode moves models to CPU between passes to minimise VRAM use. SAM3 keeps the unified model on GPU throughout.\n\n---\n\n## Workflow Tips\n\n**Low detection confidence on scientific images:** drop the box threshold to 0.05–0.10 in DINO+SAM2 mode. Fine-tuning raises effective confidence for your target class after training on as few as 20 images.\n\n**SAM3 on histology images:** SAM3 was trained on natural images and struggles with domain-specific structures like glomeruli under zero-shot conditions. The DINO+SAM2 pipeline with fine-tuning outperforms SAM3 on specialised scientific datasets — this is demonstrated in video 7.\n\n**Multi-phrase prompts:** a class called `glomerulus` might use additional phrases like `\"renal glomerulus\"` or `\"small circular structure in kidney cortex\"`. All phrases run in a single detection pass, improving recall on visually ambiguous structures.\n\n**Fine-tuning on a small dataset:** 20 images is enough to see improvement. Keep the backbone frozen (default), use batch size 1, and run 20 to 30 epochs. Watch the val F1 curve — if it plateaus before 20 epochs, training has converged.\n\n**RAG prompts:** upload papers that describe the structure you want to detect. The tool retrieves the most relevant passages and generates detection phrases grounded in the literature rather than generic descriptions.\n\n---\n\n## Acknowledgements\n\n- Grounding DINO: IDEA Research — [github.com\u002FIDEA-Research\u002FGroundingDINO](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO)\n- SAM 2: Meta AI — [github.com\u002Ffacebookresearch\u002Fsam2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsam2)\n- SAM3: Meta AI — [github.com\u002Ffacebookresearch\u002Fsam3](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsam3)\n- Hugging Face Transformers for model hosting and the unified inference API\n- Ultralytics for the SAM3 inference API\n\n---\n\n## License\n\nReleased for educational and research use. Model weights are subject to their respective licenses (Apache 2.0 for Grounding DINO; Apache 2.0 for SAM 2; see Meta's license for SAM3).\n\n---\n\n*Created by [DigitalSreeni](https:\u002F\u002Fwww.youtube.com\u002F@DigitalSreeni) — teaching Python and AI to scientists*\n","该项目展示了如何将大型语言模型和视觉-语言模型应用于实际的科学图像分析工作流程中。核心功能包括基于文本提示的对象检测、对象分割以及交互式标注工具等，主要使用了Grounding DINO与SAM 2等先进模型。项目提供了从基础概念到高级应用的一系列教程和代码示例，如通过自然语言描述来识别特定科学图像中的对象，并支持自定义模型微调以适应特定研究领域的需求。非常适合需要处理复杂图像数据的科研人员或希望探索AI在科学研究中应用潜力的技术爱好者使用。",2,"2026-06-11 04:02:31","CREATED_QUERY"]