[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83417":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":41,"readmeContent":42,"aiSummary":10,"trendingCount":15,"starSnapshotCount":15,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},83417,"VLM-AutoYOLO","Somnusochi\u002FVLM-AutoYOLO","Somnusochi","AI Auto Annotation & YOLO Training Pipeline, End-to-end object detection auto-labeling and YOLO training platform. VLM-powered annotation with NVIDIA LocateAnything-3B, manual refinement, one-click YOLO training, video keyframe extraction, and model validation. Supports image and video.","",null,"Python",113,12,58,0,5,48,24,78.14,"GNU Affero General Public License v3.0",false,"master",true,[25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],"auto-labeling","computer-vision","data-annotation","deep-learning","fastapi","locate-anything","machine-learning","nvidia","object-detection","pytorch","react","ultralytics","video-annotation","vlm","yolo","yolo-training","2026-06-12 04:01:41","# VLM-AutoYOLO\n\n[简体中文](README_ZH.md) | English\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-AGPL%20v3-blue.svg\" alt=\"License\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.12+-blue\" alt=\"Python\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNode.js-22+-green\" alt=\"Node.js\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPlatform-macOS%20%7C%20Windows%20%7C%20Linux-lightgrey\" alt=\"Platform\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGPU-MPS%20%7C%20CUDA-orange\" alt=\"GPU\">\n  \u003Ca href=\"mailto:somnusochi@gmail.com\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOpen_to_Work-🤝-brightgreen?style=flat\" alt=\"Open to Work\">\u003C\u002Fa>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSomnusochi\u002FVLM-AutoYOLO?style=social\" alt=\"Stars\">\n\u003C\u002Fp>\n\n```\n🖼️ image\u002Fvideo → 🔍 VLM \u002F SAM3 detection → 🎯 SAM2\u002FSAM3 mask → ✏️ refine → 📦 export → 🚀 YOLO → ✅ model\n```\n\n**Images or videos in → YOLO model out**, with VLM auto-labeling (LocateAnything-3B), SAM2.1 \u002F SAM3 mask refinement, and human-in-the-loop correction. Multi-format export, one-click YOLO training (detect & segment), video keyframe extraction, and model validation — all GPU-accelerated on macOS MPS and Windows\u002FLinux CUDA.\n\n![Architecture](docs\u002Farchitecture_en.png)\n\n> See [Architecture & Workflow Documentation](docs\u002Farchitecture_diagram_en.md) for detailed Mermaid diagrams.\n\n## Key Features\n- 🤖 **VLM auto-labeling**: Open-vocabulary object detection with LocateAnything-3B\n- 🎯 **SAM2 \u002F SAM3 segmentation**: Bbox → pixel-precise mask with SAM 2.1 or SAM3 text-driven detection+segmentation in one pass, BBox\u002FMask toggle on canvas\n- 🎥 **Video annotation**: Intelligent keyframe extraction (scene \u002F motion \u002F interval), SSIM dedup\n- ✏️ **Manual refinement**: Canvas draw mode, NMS filtering, hide\u002Fshow individual boxes\n- 📦 **Multi-format export**: YOLO, YOLO-Seg, COCO JSON, Pascal VOC XML, CreateML JSON\n- 🚀 **One-click training**: YOLOv8 \u002F v11 \u002F v26, detect & segment, real-time SSE progress\n- ✅ **Model validation**: Batch image \u002F video testing, MJPEG live stream, SSE video inference\n- 💾 **Smart model management**: Lazy loading, idle auto-unload, MPS\u002FCUDA strategy pattern cleanup\n- 🌐 **i18n**: English \u002F 简体中文 \u002F 日本語 · 🎨 **Theme**: Light \u002F dark mode\n\n## Documentation\n\n📚 **[User Guide (English)](docs\u002Fguide\u002Fen\u002FREADME.md)** | 📚 **[用户指南 (中文)](docs\u002Fguide\u002FREADME.md)**\n\nComprehensive guides: quick start, annotation best practices, training parameter tuning, model deployment.\n\n## Screenshots\n\n| VLM Pre-annotation & Refinement | YOLO Training |\n|--------------------------------|---------------|\n| ![VLM pre-annotation and refinement](docs\u002F1.png) | ![YOLO training](docs\u002F2.png) |\n\n| Video Keyframe Entry | Model Validation |\n|---------------------|-----------------|\n| ![Video keyframe entry](docs\u002F4.png) | ![Model validation](docs\u002F3.png) |\n\n## Tech Stack\n\n| Layer | Technology |\n|-------|-----------|\n| Visual Grounding | NVIDIA LocateAnything-3B (Qwen2.5-3B + MoonViT) |\n| Segmentation | SAM 2.1 \u002F SAM3 — Segment Anything Model 2 \u002F 3 |\n| Object Detection | YOLOv8 \u002F v11 \u002F v26 — Detect & Segment (Ultralytics) |\n| Backend | Python FastAPI + PostgreSQL + SSE |\n| Frontend | React + TypeScript + Vite + Tailwind CSS + antd |\n| GPU Memory | Strategy Pattern (`gpu_memory.py`) — CUDA expandable segments \u002F MPS synchronize + empty_cache |\n| State | Zustand + TanStack Query + ahooks |\n| i18n | i18next (English \u002F 简体中文 \u002F 日本語) |\n| Video | ffmpeg (scene \u002F motion \u002F interval extraction) |\n| Tooling | pnpm, ESLint, Prettier, Husky, commitlint, Playwright |\n\n## Quick Start\n\n### Docker Deployment\n\n> **Requirements:** Linux or Windows (WSL2) with NVIDIA GPU + [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html).\n> **macOS is not supported** — Docker on Mac has no GPU passthrough. Use [Manual Setup](#manual-setup) instead.\n\n**Quick start with pre-built images:**\n\n```bash\ncurl -O https:\u002F\u002Fraw.githubusercontent.com\u002FSomnusochi\u002FVLM-AutoYOLO\u002Fmaster\u002Fdocker-compose.yml\ndocker compose up -d\nopen http:\u002F\u002Flocalhost        # Frontend\nopen http:\u002F\u002Flocalhost:8000\u002Fdocs  # API docs\n```\n\n**Build from source:**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FSomnusochi\u002FVLM-AutoYOLO.git\ncd VLM-AutoYOLO\ndocker compose up -d --build\n```\n\n**Services:**\n\n| Service | Port | Description |\n|---------|------|-------------|\n| Frontend | 80 | React web UI (Nginx) |\n| Backend | 8000 | FastAPI server |\n| SAM3 | 8002 | SAM3 standalone inference service |\n| Database | 5432 | PostgreSQL |\n\n**GPU Support** — add to `docker-compose.yml`:\n\n```yaml\nbackend:\n  deploy:\n    resources:\n      reservations:\n        devices:\n          - driver: nvidia\n            count: 1\n            capabilities: [gpu]\n  environment:\n    DEVICE: cuda\n```\n\n**Persistent Storage (Docker volumes):**\n- `pgdata` — Database · `model-cache` — VLM, SAM2 & SAM3 models · `uploads` — User images\u002Fvideos · `training-data` — YOLO training outputs\n\n**Backup \u002F Restore:**\n\n```bash\ndocker compose exec db pg_dump -U postgres autolabeling > backup.sql\ncat backup.sql | docker compose exec -T db psql -U postgres autolabeling\n```\n\n### Manual Setup\n\n**Requirements:**\n\n| Resource | Minimum | Recommended |\n|----------|---------|-------------|\n| Python | 3.12+ | 3.12+ |\n| Node.js | 22+ | 22+ |\n| PostgreSQL | 16+ | 16+ |\n| ffmpeg | Any | — |\n| macOS | Apple Silicon 16GB | 24GB+ |\n| NVIDIA GPU | 12GB VRAM | 16GB+ |\n\n**Setup:**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FSomnusochi\u002FVLM-AutoYOLO.git\ncd VLM-AutoYOLO\n\n# Backend\ncd backend\npython3 -m venv .venv\nsource .venv\u002Fbin\u002Factivate  # Windows: .venv\\Scripts\\activate\npip install -r requirements.txt\ncd ..\n\n# Frontend\ncd frontend\npnpm install\ncd ..\n\n# Database (PostgreSQL recommended, but SQLite is supported out of the box)\n# If using PostgreSQL:\n# psql -d postgres -c \"CREATE DATABASE autolabeling;\"\n# cp backend\u002F.env.example backend\u002F.env\n# If you prefer a zero-setup SQLite database, just skip the two steps above. The system will auto-generate autolabeling.db\n\n# Migrations\ncd backend\nPYTHONPATH=. alembic upgrade head\n```\n\n**Pre-download models (optional):**\n\n```bash\nhuggingface-cli download nvidia\u002FLocateAnything-3B --local-dir backend\u002Fmodel\n```\n\n**Launch:**\n\n```bash\n.\u002Fstart.sh   # macOS \u002F Linux\nstart.bat    # Windows\n```\n\n| Service | URL |\n|---------|-----|\n| Frontend | http:\u002F\u002Flocalhost:5173 |\n| Backend | http:\u002F\u002Flocalhost:8000 |\n| API Docs | http:\u002F\u002Flocalhost:8000\u002Fdocs |\n\n## Project Structure\n\nFull directory tree: **[docs\u002FSTRUCTURE.md](docs\u002FSTRUCTURE.md)**\n\n## Features\n\n### VLM Pre-annotation\n\nUpload images or video keyframes with open-vocabulary descriptions (e.g. `fire, smoke`, `red car`). LocateAnything-3B automatically detects and draws bounding boxes.\n\n- Open-vocabulary natural language descriptions\n- Auto-resize by long-side cap (VRAM-based: 800–1333px)\n- Batch upload folders or video keyframes, streaming results\n\n### SAM2 Segmentation\n\nEnable SAM2 (Segment Anything Model 2) to refine VLM bounding boxes into pixel-precise masks.\n\n- Check \"Enable SAM2 Segmentation\" before detection — runs automatically after VLM\n- SAM 2.1 model (base+), lazy-loaded with idle auto-unload\n- Score threshold slider for mask quality filtering\n- Masks rendered as semi-transparent overlays on canvas\n- BBox and Mask independently toggled on both main canvas and hover preview\n- Result table shows polygon vertex count per box\n\n### SAM3 Detection + Segmentation\n\nSwitch to SAM3 mode for text-driven detection and segmentation in a single pass — no VLM required.\n\n- Toggle between VLM+SAM2 and SAM3 via the model selector in the sidebar\n- Enter open-vocabulary text prompts (e.g. `cat`, `red car`) — SAM3 detects and segments all matching instances\n- **Confidence threshold** slider (0.0–1.0, default 0.5) controls detection sensitivity\n- **Mask threshold** slider (0.0–1.0, default 0.5) controls mask tightness\n- Enable\u002Fdisable segmentation independently — bbox-only mode skips mask extraction for faster results\n- SAM3 runs as a standalone HTTP service on port 8002 with its own venv (`backend\u002Fsam3-venv\u002F`)\n- **Requires `HF_TOKEN`** — set this env var before starting the backend. Two steps:\n  1. Open [huggingface.co\u002Ffacebook\u002Fsam3](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam3) in browser, click **\"Agree and access repository\"**\n  2. Create a **Read** token at [huggingface.co\u002Fsettings\u002Ftokens](https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens) (no need for Fine-grained — a plain Read token inherits your account's permissions)\n  Model cached in `~\u002F.cache\u002Fhuggingface\u002Fhub\u002F` after first download\n- Auto-starts on first use, idle auto-unload after 10 min\n- Real-time loading status via SSE (`starting` → `loading` → `loaded`)\n- Manual unload button to free GPU memory\n- Backend auto-switches: using SAM3 unloads VLM\u002FSAM2, and vice versa\n- Detection records tagged with `model_type` (VLM \u002F VLM+SAM2 \u002F SAM3) for traceability\n\n### Video Annotation\n\nUpload a video, extract keyframes, select and batch-annotate.\n\n- **Three extraction modes**: scene change, motion detection (optical flow), fixed interval\n- **SSIM deduplication**: auto-removes near-duplicate frames\n- **Timeline preview**: horizontal scrollable strip, click for full-size view\n- **Multi-select**: check frames, select\u002Fcancel all, load to annotation queue\n\n### Manual Annotation\n\nCanvas-based annotation with View \u002F Draw modes.\n\n- Category quick-fill from history\n- VLM pre-annotation baseline → delete mistakes → draw missing boxes\n- All \u002F Best \u002F NMS filter modes, settings saved per detection\n- Hide individual boxes while inspecting dense results\n- Per-frame re-detection\n\n### History Management\n\n- Thumbnail + category tag previews, tag-based multi-select filtering\n- Click to view details, re-detect with updated labels, frontend pagination\n- Single \u002F batch export in **5 formats**: YOLO, YOLO-Seg, COCO JSON, Pascal VOC XML, CreateML JSON\n- Format selection via dropdown menu, one-click zip download\n\n### YOLO Training\n\n- **Series**: YOLOv8 \u002F v11 \u002F v26 (n\u002Fs\u002Fm\u002Fl\u002Fx)\n- **Task types**: Object Detection (Detect), Instance Segmentation (Segment)\n- Segmentation training auto-uses SAM2 polygon labels; falls back to bbox when unavailable\n- Tag filter + thumbnail preview for precise data selection\n- Dataset split presets (70\u002F20\u002F10, 80\u002F20, 90\u002F10, 60\u002F20\u002F20)\n- Real-time SSE progress: Epoch \u002F Loss \u002F mAP50\n- Auto ONNX export; download PT \u002F ONNX \u002F dataset zip\n\n### Model Validation\n\n- **Dual source**: trained models or externally uploaded `.pt` files\n- **Conf \u002F IoU sliders** for real-time threshold tuning\n- **Batch image validation** with bounding boxes and confidence scores\n- **Video validation** (three modes):\n  - MJPEG live stream with interactive play\u002Fpause\n  - SSE prediction stream with per-frame JSON events\n  - Sync batch prediction — all frames at once\n- Temporary results; export predictions as YOLO `.txt` files\n\n### Model Management\n\n- **Lazy loading**: VLM, SAM2, and SAM3 load on first use, unload after idle (default 10 min)\n- **Idle watchdog**: all three models auto-unload after `MODEL_IDLE_TIMEOUT_SECONDS` of inactivity\n- **Unified SSE status**: `GET \u002Fapi\u002Fv1\u002Fmodel\u002Fevents` streams VLM, SAM2, SAM3 status in one connection\n- **Manual unload**: each model has its own unload button and API endpoint\n- **GPU memory**: Strategy Pattern (`gpu_memory.py`) — CUDA `expandable_segments` \u002F MPS `synchronize`+`empty_cache`+`gc`\n\n## API Reference\n\nFull API documentation with request\u002Fresponse examples: **[docs\u002FAPI.md](docs\u002FAPI.md)**\n\n## Cross-Platform\n\n| Platform | Inference | Training |\n|----------|-----------|----------|\n| macOS (Apple Silicon) | MPS | MPS |\n| Linux \u002F Windows (NVIDIA) | CUDA | CUDA |\n\nAuto-detection: CUDA → MPS. Override via `DEVICE` env. **CPU not supported.**\n\n## Inference Benchmarks\n\nTested locally on an **Apple MacBook Pro (M4 Pro, 24GB Unified Memory)** using Apple MPS hardware acceleration.\n\n| Image Resolution (Max Side) | Inference Latency | Actual Memory Footprint |\n| :--- | :--- | :--- |\n| **Thumbnail (256px)** | `~0.68s` | Stable around `~11.8GB` |\n| **High-Res (1024px)** | `~4.35s` | Stable around `~11.8GB` |\n\nFull detailed benchmarks across different hardware configurations: **[docs\u002FBENCHMARKS.md](docs\u002FBENCHMARKS.md)**\n\n## Highlights\n\n- **MPS \u002F CUDA full-pipeline GPU acceleration** — VLM, SAM2, and YOLO training all GPU-accelerated\n- **Strategy Pattern GPU memory** — `gpu_memory.py` centralizes CUDA \u002F MPS cleanup; `expandable_segments:True`\n- **SAM2 \u002F SAM3 mask refinement** — SAM2 refines VLM bboxes; SAM3 does text-driven detection+segmentation in one pass\n- **5 export formats** — YOLO, YOLO-Seg, COCO, Pascal VOC, CreateML\n- **Detect & Segment training** — polygon labels auto-used when SAM2 masks are available\n- **Cross-platform** — macOS MPS, Windows \u002F Linux CUDA, unified codebase\n- **Unified SSE model status** — single EventSource for VLM, SAM2, SAM3 states; no polling\n\n## Development\n\n```bash\n# Frontend\ncd frontend && pnpm install && pnpm run lint && pnpm run build\n\n# Backend\ncd backend && source .venv\u002Fbin\u002Factivate\nPYTHONPATH=. alembic upgrade head\npython -m compileall app alembic\n```\n\n## Stargazers\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=Somnusochi\u002FVLM-AutoYOLO&type=Date)](https:\u002F\u002Fstar-history.com\u002F#Somnusochi\u002FVLM-AutoYOLO&Date)\n\n## License\n\nCode: [AGPL-3.0](LICENSE).\n\nThird-party dependencies:\n- LocateAnything-3B model — [NVIDIA License](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FLocateAnything-3B\u002Fblob\u002Fmain\u002FLICENSE) (non-commercial use only)\n- SAM3 model — [Facebook Research License](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fsam3) (gated repository, requires HuggingFace access token)\n- Ultralytics YOLO — [AGPL-3.0](https:\u002F\u002Fgithub.com\u002Fultralytics\u002Fultralytics\u002Fblob\u002Fmain\u002FLICENSE) (copyleft; training\u002Fdeployment may trigger obligations)\n\n---\n\nIf this project helps you, please ⭐ [star it on GitHub](https:\u002F\u002Fgithub.com\u002FSomnusochi\u002FVLM-AutoYOLO). I'm open to new opportunities — reach out: somnusochi@gmail.com\n",2,"2026-06-11 04:11:07","CREATED_QUERY"]