[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72574":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72574,"yoloe","THU-MIG\u002Fyoloe","THU-MIG","YOLOE: Real-Time Seeing Anything [ICCV 2025]","https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07465",null,"Python",2164,204,11,80,0,3,13,38,9,75.24,"GNU Affero General Public License v3.0",false,"main",true,[],"2026-06-12 04:01:06","# [YOLOE: Real-Time Seeing Anything](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07465)\n\nOfficial PyTorch implementation of **YOLOE**. ICCV 2025.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Fcomparison.svg\" width=70%> \u003Cbr>\n  Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and YOLO-Worldv2 in terms of open text prompts.\n\u003C\u002Fp>\n\n[YOLOE: Real-Time Seeing Anything](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07465).\\\nAo Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding\\\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2503.07465-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07465) [![Hugging Face Models](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Ftree\u002Fmain) [![Hugging Face Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjameslahm\u002Fyoloe) [![Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Froboflow-ai\u002Fnotebooks\u002Fblob\u002Fmain\u002Fnotebooks\u002Fzero-shot-object-detection-and-segmentation-with-yoloe.ipynb) [![Hugging Face Collection](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Collection-blue)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fjameslahm\u002Fyoloe-67d5110aabaefbe129c15917) [![Openbayes Demo](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Demo&message=OpenBayes%E8%B4%9D%E5%BC%8F%E8%AE%A1%E7%AE%97&color=green)](https:\u002F\u002Fopenbayes.com\u002Fconsole\u002Fpublic\u002Ftutorials\u002FBQhUorEqyVX)\n\n\nWe introduce **YOLOE(ye)**, a highly **efficient**, **unified**, and **open** object detection and segmentation model, like human eye, under different prompt mechanisms, like *texts*, *visual inputs*, and *prompt-free paradigm*, with **zero inference and transferring overhead** compared with closed-set YOLOs.\n\n\u003C!-- \u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Fpipeline.svg\" width=96%> \u003Cbr>\n\u003C\u002Fp> -->\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Fvisualization.svg\" width=96%> \u003Cbr>\n\u003C\u002Fp>\n\n\n\u003Cdetails>\n  \u003Csummary>\n  \u003Cfont size=\"+1\">Abstract\u003C\u002Ffont>\n  \u003C\u002Fsummary>\nObject detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with $3\\times$ less training cost and $1.4\\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 $AP^b$ and 0.4 $AP^m$ gains over closed-set YOLOv8-L with nearly $4\\times$ less training time.\n\u003C\u002Fdetails>\n\u003Cp>\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Fpipeline.svg\" width=96%> \u003Cbr>\n\u003C\u002Fp>\n\n## Performance\n\n### Zero-shot detection evaluation\n\n- *Fixed AP* is reported on LVIS `minival` set with text (T) \u002F visual (V) prompts.\n- Training time is for text prompts with detection based on 8 Nvidia RTX4090 GPUs.\n- FPS is measured on T4 with TensorRT and iPhone 12 with CoreML, respectively.\n- For training data, OG denotes Objects365v1 and GoldG.\n- YOLOE can become YOLOs after re-parameterization with **zero inference and transferring overhead**.\n\n| Model | Size | Prompt | Params | Data | Time | FPS | $AP$ | $AP_r$ | $AP_c$ | $AP_f$ | Log |\n|---|---|---|---|---|---|---|---|---|---|---|---|\n| [YOLOE-v8-S](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8s-seg.pt) | 640 | T \u002F V | 12M \u002F 13M | OG | 12.0h | 305.8 \u002F 64.3 | 27.9 \u002F 26.2 | 22.3 \u002F 21.3 | 27.8 \u002F 27.7 | 29.0 \u002F 25.7 | [T](.\u002Flogs\u002Fyoloe-v8s-seg) \u002F [V](.\u002Flogs\u002Fyoloe-v8s-seg-vp) |\n| [YOLOE-v8-M](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8m-seg.pt) | 640 | T \u002F V | 27M \u002F 30M | OG | 17.0h | 156.7 \u002F 41.7 | 32.6 \u002F 31.0 | 26.9 \u002F 27.0 | 31.9 \u002F 31.7 | 34.4 \u002F 31.1 | [T](.\u002Flogs\u002Fyoloe-v8m-seg) \u002F [V](.\u002Flogs\u002Fyoloe-v8m-seg-vp) |\n| [YOLOE-v8-L](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8l-seg.pt) | 640 | T \u002F V | 45M \u002F 50M | OG | 22.5h | 102.5 \u002F 27.2 | 35.9 \u002F 34.2 | 33.2 \u002F 33.2 | 34.8 \u002F 34.6 | 37.3 \u002F 34.1 | [T](.\u002Flogs\u002Fyoloe-v8l-seg) \u002F [V](.\u002Flogs\u002Fyoloe-v8l-seg-vp) |\n| [YOLOE-11-S](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11s-seg.pt) | 640 | T \u002F V | 10M \u002F 12M | OG | 13.0h | 301.2 \u002F 73.3 | 27.5 \u002F 26.3 | 21.4 \u002F 22.5 | 26.8 \u002F 27.1 | 29.3 \u002F 26.4 | [T](.\u002Flogs\u002Fyoloe-11s-seg) \u002F [V](.\u002Flogs\u002Fyoloe-11s-seg-vp) |\n| [YOLOE-11-M](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11m-seg.pt) | 640 | T \u002F V | 21M \u002F 27M | OG | 18.5h | 168.3 \u002F 39.2 | 33.0 \u002F 31.4 | 26.9 \u002F 27.1 | 32.5 \u002F 31.9 | 34.5 \u002F 31.7 | [T](.\u002Flogs\u002Fyoloe-11m-seg) \u002F [V](.\u002Flogs\u002Fyoloe-11m-seg-vp) |\n| [YOLOE-11-L](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11l-seg.pt) | 640 | T \u002F V | 26M \u002F 32M | OG | 23.5h | 130.5 \u002F 35.1 | 35.2 \u002F 33.7 | 29.1 \u002F 28.1 | 35.0 \u002F 34.6 | 36.5 \u002F 33.8 | [T](.\u002Flogs\u002Fyoloe-11l-seg) \u002F [V](.\u002Flogs\u002Fyoloe-11l-seg-vp) |\n\n### Zero-shot segmentation evaluation\n\n- The model is the same as above in [Zero-shot detection evaluation](#zero-shot-detection-evaluation).\n- *Standard AP\u003Csup>m\u003C\u002Fsup>* is reported on LVIS `val` set with text (T) \u002F visual (V) prompts.\n\n| Model | Size | Prompt | $AP^m$ | $AP_r^m$ | $AP_c^m$ | $AP_f^m$ |\n|---|---|---|---|---|---|---|\n| YOLOE-v8-S | 640 | T \u002F V | 17.7 \u002F 16.8 | 15.5 \u002F 13.5 | 16.3 \u002F 16.7 | 20.3 \u002F 18.2 |\n| YOLOE-v8-M | 640 | T \u002F V | 20.8 \u002F 20.3 | 17.2 \u002F 17.0 | 19.2 \u002F 20.1 | 24.2 \u002F 22.0 |\n| YOLOE-v8-L | 640 | T \u002F V | 23.5 \u002F 22.0 | 21.9 \u002F 16.5 | 21.6 \u002F 22.1 | 26.4 \u002F 24.3 |\n| YOLOE-11-S | 640 | T \u002F V | 17.6 \u002F 17.1 | 16.1 \u002F 14.4 | 15.6 \u002F 16.8 | 20.5 \u002F 18.6 |\n| YOLOE-11-M | 640 | T \u002F V | 21.1 \u002F 21.0 | 17.2 \u002F 18.3 | 19.6 \u002F 20.6 | 24.4 \u002F 22.6 |\n| YOLOE-11-L | 640 | T \u002F V | 22.6 \u002F 22.5 | 19.3 \u002F 20.5 | 20.9 \u002F 21.7 | 26.0 \u002F 24.1 |\n\n### Prompt-free evaluation\n\n- The model is the same as above in [Zero-shot detection evaluation](#zero-shot-detection-evaluation) except the specialized prompt embedding.\n- *Fixed AP* is reported on LVIS `minival` set and FPS is measured on Nvidia T4 GPU with Pytorch.\n\n| Model | Size | Params | $AP$ | $AP_r$ | $AP_c$ | $AP_f$ | FPS | Log |\n|---|---|---|---|---|---|---|---|---|\n| [YOLOE-v8-S](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8s-seg-pf.pt) | 640 | 13M | 21.0 | 19.1 | 21.3 | 21.0 | 95.8 | [PF](.\u002Flogs\u002Fyoloe-v8s-seg-pf\u002F) |\n| [YOLOE-v8-M](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8m-seg-pf.pt) | 640 | 29M | 24.7 | 22.2 | 24.5 | 25.3 | 45.9 | [PF](.\u002Flogs\u002Fyoloe-v8m-seg-pf\u002F) |\n| [YOLOE-v8-L](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8l-seg-pf.pt) | 640 | 47M | 27.2 | 23.5 | 27.0 | 28.0 | 25.3 | [PF](.\u002Flogs\u002Fyoloe-v8l-seg-pf\u002F) |\n| [YOLOE-11-S](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11s-seg-pf.pt) | 640 | 11M | 20.6 | 18.4 | 20.2 | 21.3 | 93.0 | [PF](.\u002Flogs\u002Fyoloe-11s-seg-pf\u002F) |\n| [YOLOE-11-M](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11m-seg-pf.pt) | 640 | 24M | 25.5 | 21.6 | 25.5 | 26.1 | 42.5 | [PF](.\u002Flogs\u002Fyoloe-11m-seg-pf\u002F) |\n| [YOLOE-11-L](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11l-seg-pf.pt) | 640 | 29M | 26.3 | 22.7 | 25.8 | 27.5 | 34.9 | [PF](.\u002Flogs\u002Fyoloe-11l-seg-pf\u002F) |\n\n### Downstream transfer on COCO\n\n- During transferring, YOLOE-v8 \u002F YOLOE-11 is **exactly the same** as YOLOv8 \u002F YOLO11.\n- For *Linear probing*, only the last conv in classification head is trainable.\n- For *Full tuning*, all parameters are trainable.\n\n| Model | Size | Epochs | $AP^b$ | $AP^b_{50}$ | $AP^b_{75}$ | $AP^m$ | $AP^m_{50}$ | $AP^m_{75}$ | Log |\n|---|---|---|---|---|---|---|---|---|---|\n| Linear probing | | | | | | | | | |\n| [YOLOE-v8-S](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8s-seg-coco-pe.pt) | 640 | 10 | 35.6 | 51.5 | 38.9 | 30.3 | 48.2 | 32.0 | [LP](.\u002Flogs\u002Fyoloe-v8s-seg-coco-pe\u002F) |\n| [YOLOE-v8-M](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8m-seg-coco-pe.pt) | 640 | 10 | 42.2 | 59.2 | 46.3 | 35.5 | 55.6 | 37.7 | [LP](.\u002Flogs\u002Fyoloe-v8m-seg-coco-pe\u002F) |\n| [YOLOE-v8-L](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8l-seg-coco-pe.pt) | 640 | 10 | 45.4 | 63.3 | 50.0 | 38.3 | 59.6 | 40.8 | [LP](.\u002Flogs\u002Fyoloe-v8l-seg-coco-pe\u002F) |\n| [YOLOE-11-S](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11s-seg-coco-pe.pt) | 640 | 10 | 37.0 | 52.9 | 40.4 | 31.5 | 49.7 | 33.5 | [LP](.\u002Flogs\u002Fyoloe-11s-seg-coco-pe\u002F) |\n| [YOLOE-11-M](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11m-seg-coco-pe.pt) | 640 | 10 | 43.1 | 60.6 | 47.4 | 36.5 | 56.9 | 39.0 | [LP](.\u002Flogs\u002Fyoloe-11m-seg-coco-pe\u002F) |\n| [YOLOE-11-L](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11l-seg-coco-pe.pt) | 640 | 10 | 45.1 | 62.8 | 49.5 | 38.0 | 59.2 | 40.6 | [LP](.\u002Flogs\u002Fyoloe-11l-seg-coco-pe\u002F) |\n| Full tuning | | | | | | | | | |\n| [YOLOE-v8-S](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8s-seg-coco.pt) | 640 | 160 | 45.0 | 61.6 | 49.1 | 36.7 | 58.3 | 39.1 | [FT](.\u002Flogs\u002Fyoloe-v8s-seg-coco\u002F) |\n| [YOLOE-v8-M](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8m-seg-coco.pt) | 640 | 80 | 50.4 | 67.0 | 55.2 | 40.9 | 63.7 | 43.5 | [FT](.\u002Flogs\u002Fyoloe-v8m-seg-coco\u002F) |\n| [YOLOE-v8-L](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-v8l-seg-coco.pt) | 640 | 80 | 53.0 | 69.8 | 57.9 | 42.7 | 66.5 | 45.6 | [FT](.\u002Flogs\u002Fyoloe-v8l-seg-coco\u002F) |\n| [YOLOE-11-S](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11s-seg-coco.pt) | 640 | 160 | 46.2 | 62.9 | 50.0 | 37.6 | 59.3 | 40.1 | [FT](.\u002Flogs\u002Fyoloe-11s-seg-coco\u002F) |\n| [YOLOE-11-M](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11m-seg-coco.pt) | 640 | 80 | 51.3 | 68.3 | 56.0 | 41.5 | 64.8 | 44.3 | [FT](.\u002Flogs\u002Fyoloe-11m-seg-coco\u002F) |\n| [YOLOE-11-L](https:\u002F\u002Fhuggingface.co\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fyoloe-11l-seg-coco.pt) | 640 | 80 | 52.6 | 69.7 | 57.5 | 42.4 | 66.2 | 45.2 | [FT](.\u002Flogs\u002Fyoloe-11l-seg-coco\u002F) |\n\n## Installation\nYou could also quickly try YOLOE for [prediction](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1LRFEVarAIVSnIeL_pCPtsFL87FsEe46U?usp=sharing) and [transferring](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1y-r4y_owfFAfyqbqP2t64H7IqjURkKwe?usp=sharing) using colab notebooks.\n\n`conda` virtual environment is recommended. \n```bash\nconda create -n yoloe python=3.10 -y\nconda activate yoloe\n\n# If you clone this repo, please use this\npip install -r requirements.txt\n# Or you can also directly install the repo by this\npip install git+https:\u002F\u002Fgithub.com\u002FTHU-MIG\u002Fyoloe.git#subdirectory=third_party\u002FCLIP\npip install git+https:\u002F\u002Fgithub.com\u002FTHU-MIG\u002Fyoloe.git#subdirectory=third_party\u002Fml-mobileclip\npip install git+https:\u002F\u002Fgithub.com\u002FTHU-MIG\u002Fyoloe.git#subdirectory=third_party\u002Flvis-api\npip install git+https:\u002F\u002Fgithub.com\u002FTHU-MIG\u002Fyoloe.git\n\nwget https:\u002F\u002Fdocs-assets.developer.apple.com\u002Fml-research\u002Fdatasets\u002Fmobileclip\u002Fmobileclip_blt.pt\n```\n\n## Demo\nIf desired objects are not identified, pleaset set a **smaller** confidence threshold, e.g., for visual prompts with handcrafted shape or cross-image prompts.\n```bash\n# Optional for mirror: export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\npip install gradio==4.42.0 gradio_image_prompter==0.1.0 fastapi==0.112.2 huggingface-hub==0.26.3 gradio_client==1.3.0 pydantic==2.10.6\npython app.py\n# Please visit http:\u002F\u002F127.0.0.1:7860\n```\n\n## Prediction\n```bash\n# Download pretrained models\n# Optional for mirror: export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n# Please replace the pt file with your desired model\npip install huggingface-hub==0.26.3\nhuggingface-cli download jameslahm\u002Fyoloe yoloe-v8l-seg.pt --local-dir pretrain\n```\nFor yoloe-(v8s\u002Fm\u002Fl)\u002F(11s\u002Fm\u002Fl)-seg, Models can also be automatically downloaded using `from_pretrained`.\n```python\nfrom ultralytics import YOLOE\nmodel = YOLOE.from_pretrained(\"jameslahm\u002Fyoloe-v8l-seg\")\n```\n\n### Text prompt\n```bash\npython predict_text_prompt.py \\\n    --source ultralytics\u002Fassets\u002Fbus.jpg \\\n    --checkpoint pretrain\u002Fyoloe-v8l-seg.pt \\\n    --names person dog cat \\\n    --device cuda:0\n```\n\n### Visual prompt\n```bash\npython predict_visual_prompt.py\n```\n\n### Prompt free\n```bash\npython predict_prompt_free.py\n```\n\n## Transferring\nAfter pretraining, YOLOE-v8 \u002F YOLOE-11 can be re-parameterized into the same architecture as YOLOv8 \u002F YOLO11, with **zero overhead for transferring**.\n\n### Linear probing\nOnly the last conv, ie., the prompt embedding, is trainable.\n```bash\npython train_pe.py\n```\n\n### Full tuning\nAll parameters are trainable, for better performance.\n```bash\n# For models with s scale, please change the epochs to 160 for longer training\npython train_pe_all.py\n```\n\n## Validation\n\n### Data\n- Please download LVIS following [here](https:\u002F\u002Fdocs.ultralytics.com\u002Fzh\u002Fdatasets\u002Fdetect\u002Flvis\u002F) or [lvis.yaml](.\u002Fultralytics\u002Fcfg\u002Fdatasets\u002Flvis.yaml).\n- We use this [`minival.txt`](.\u002Ftools\u002Flvis\u002Fminival.txt) with background images for evaluation.\n\n```bash\n# For evaluation with visual prompt, please obtain the referring data.\npython tools\u002Fgenerate_lvis_visual_prompt_data.py\n```\n\n### Zero-shot evaluation on LVIS\n- For text prompts, `python val.py`.\n- For visual prompts, `python val_vp.py`\n\nFor *Fixed AP*, please refer to the comments in `val.py` and `val_vp.py`, and use `tools\u002Feval_fixed_ap.py` for evaluation.\n\n### Prompt-free evaluation\n```bash\npython val_pe_free.py\npython tools\u002Feval_open_ended.py --json ..\u002Fdatasets\u002Flvis\u002Fannotations\u002Flvis_v1_minival.json --pred runs\u002Fdetect\u002Fval\u002Fpredictions.json --fixed\n```\n\n### Downstream transfer on COCO\n```bash\npython val_coco.py\n```\n\n## Training \n\nThe training includes three stages:\n- YOLOE is trained with text prompts for detection and segmentation for 30 epochs.\n- Only visual prompt encoder (SAVPE) is trained with visual prompts for 2 epochs.\n- Only specialized prompt embedding for prompt free is trained for 1 epochs.\n\n### Data\n\n| Images | Raw Annotations | Processed Annotations |\n|---|---|---|\n| [Objects365v1](https:\u002F\u002Fopendatalab.com\u002FOpenDataLab\u002FObjects365_v1) | [objects365_train.json](https:\u002F\u002Fopendatalab.com\u002FOpenDataLab\u002FObjects365_v1) | [objects365_train_segm.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Fobjects365_train_segm.json) |\n| [GQA](https:\u002F\u002Fnlp.stanford.edu\u002Fdata\u002Fgqa\u002Fimages.zip) | [\tfinal_mixed_train_noo_coco.json](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fmdetr_annotations\u002Ffinal_mixed_train_no_coco.json)  | [\tfinal_mixed_train_noo_coco_segm.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Ffinal_mixed_train_no_coco_segm.json) |\n| [Flickr30k](https:\u002F\u002Fshannon.cs.illinois.edu\u002FDenotationGraph\u002F) | [final_flickr_separateGT_train.json](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fmdetr_annotations\u002Ffinal_flickr_separateGT_train.json) | [final_flickr_separateGT_train_segm.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjameslahm\u002Fyoloe\u002Fblob\u002Fmain\u002Ffinal_flickr_separateGT_train_segm.json) |\n\nFor annotations, you can directly use our preprocessed ones or use the following script to obtain the processed annotations with segmentation masks.\n```bash\n# Generate segmentation data\nconda create -n sam2 python==3.10.16\nconda activate sam2\npip install -r third_party\u002Fsam2\u002Frequirements.txt\npip install -e third_party\u002Fsam2\u002F\n\npython tools\u002Fgenerate_sam_masks.py --img-path ..\u002Fdatasets\u002FObjects365v1\u002Fimages\u002Ftrain --json-path ..\u002Fdatasets\u002FObjects365v1\u002Fannotations\u002Fobjects365_train.json --batch\npython tools\u002Fgenerate_sam_masks.py --img-path ..\u002Fdatasets\u002Fflickr\u002Ffull_images\u002F --json-path ..\u002Fdatasets\u002Fflickr\u002Fannotations\u002Ffinal_flickr_separateGT_train.json\npython tools\u002Fgenerate_sam_masks.py --img-path ..\u002Fdatasets\u002Fmixed_grounding\u002Fgqa\u002Fimages --json-path ..\u002Fdatasets\u002Fmixed_grounding\u002Fannotations\u002Ffinal_mixed_train_no_coco.json\n\n# Generate objects365v1 labels\npython tools\u002Fgenerate_objects365v1.py\n```\n\nThen, please generate the data and embedding cache for training.\n```bash\n# Generate grounding segmentation cache\npython tools\u002Fgenerate_grounding_cache.py --img-path ..\u002Fdatasets\u002Fflickr\u002Ffull_images\u002F --json-path ..\u002Fdatasets\u002Fflickr\u002Fannotations\u002Ffinal_flickr_separateGT_train_segm.json\npython tools\u002Fgenerate_grounding_cache.py --img-path ..\u002Fdatasets\u002Fmixed_grounding\u002Fgqa\u002Fimages --json-path ..\u002Fdatasets\u002Fmixed_grounding\u002Fannotations\u002Ffinal_mixed_train_no_coco_segm.json\n\n# Generate train label embeddings\npython tools\u002Fgenerate_label_embedding.py\npython tools\u002Fgenerate_global_neg_cat.py\n```\nAt last, please download MobileCLIP-B(LT) for text encoder.\n```bash\nwget https:\u002F\u002Fdocs-assets.developer.apple.com\u002Fml-research\u002Fdatasets\u002Fmobileclip\u002Fmobileclip_blt.pt\n```\n\n### Text prompt\n```bash\n# For models with l scale, please change the initialization by referring to the comments in Line 549 in ultralytics\u002Fnn\u002Fmoduels\u002Fhead.py\n# If you want to train YOLOE only for detection, you can use `train.py` \npython train_seg.py\n```\n\n### Visual prompt\n```bash\n# For visual prompt, because only SAVPE is trained, we can adopt the detection pipeline with less training time\n\n# First, obtain the detection model\npython tools\u002Fconvert_segm2det.py\n# Then, train the SAVPE module\npython train_vp.py\n# After training, please use tools\u002Fget_vp_segm.py to add the segmentation head\n# python tools\u002Fget_vp_segm.py\n```\n\n### Prompt free\n```bash\n# Generate LVIS with single class for evaluation during training\npython tools\u002Fgenerate_lvis_sc.py\n\n# Similar to visual prompt, because only the specialized prompt embedding is trained, we can adopt the detection pipeline with less training time\npython tools\u002Fconvert_segm2det.py\npython train_pe_free.py\n# After training, please use tools\u002Fget_pf_free_segm.py to add the segmentation head\n# python tools\u002Fget_pf_free_segm.py\n```\n\n## Export\nAfter re-parameterization, YOLOE-v8 \u002F YOLOE-11 can be exported into the identical format as YOLOv8 \u002F YOLO11, with **zero overhead for inference**.\n```bash\npip install onnx coremltools onnxslim\npython export.py\n```\n\n## Benchmark\n- For TensorRT, please refer to `benchmark.sh`.\n- For CoreML, please use the benchmark tool from [XCode 14](https:\u002F\u002Fdeveloper.apple.com\u002Fvideos\u002Fplay\u002Fwwdc2022\u002F10027\u002F).\n- For prompt-free setting, please refer to `tools\u002Fbenchmark_pf.py`.\n\n## Acknowledgement\n\nThe code base is built with [ultralytics](https:\u002F\u002Fgithub.com\u002Fultralytics\u002Fultralytics), [YOLO-World](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FYOLO-World), [MobileCLIP](https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-mobileclip), [lvis-api](https:\u002F\u002Fgithub.com\u002Flvis-dataset\u002Flvis-api), [CLIP](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP), and [GenerateU](https:\u002F\u002Fgithub.com\u002FFoundationVision\u002FGenerateU).\n\nThanks for the great implementations! \n\n## Citation\n\nIf our code or models help your work, please cite our paper:\n```BibTeX\n@misc{wang2025yoloerealtimeseeing,\n      title={YOLOE: Real-Time Seeing Anything}, \n      author={Ao Wang and Lihao Liu and Hui Chen and Zijia Lin and Jungong Han and Guiguang Ding},\n      year={2025},\n      eprint={2503.07465},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07465}, \n}\n```\n","YOLOE是一个高效、统一且开放的对象检测与分割模型，旨在实现实时识别各种对象。它通过文本、视觉输入以及无提示范式等多种提示机制进行工作，相比传统的封闭集YOLO系列模型，YOLOE在保持零推理和迁移开销的同时，能够适应更加开放的场景。该模型采用PyTorch框架实现，并引入了可重参数化区域-文本对齐（RepRTA）策略来优化预训练文本嵌入，增强视觉-文本的一致性。此外，对于基于视觉的提示，YOLOE提出了语义激活的视觉处理方法。此项目适用于需要灵活应对未知类别对象检测及分割的应用场景，如智能监控、自动驾驶等。",2,"2026-06-11 03:42:39","high_star"]