[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79976":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":13,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":13,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":20,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},79976,"VideoSeeker","gaotiexinqu\u002FVideoSeeker","gaotiexinqu","VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation",null,"Python",120,6,3,1,0,41,2.54,false,"main",true,[],"2026-06-12 02:03:56","# VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green.svg\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green.svg\" alt=\"License\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.12-3776AB?logo=python&logoColor=FFD43B\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.12-3776AB?logo=python&logoColor=FFD43B\" alt=\"Python\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgaotiexinqu\u002FVideoSeeker#citation\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCite-VideoSeeker-orange?logo=readme&logoColor=white\" alt=\"Citation\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cfont size=7>\u003Cdiv align='center' > [[🌐 Homepage](https:\u002F\u002Fgaotiexinqu.github.io\u002FVideoSeeker\u002F)] [[📖 arXiv Paper]()] [[📊 Code](https:\u002F\u002Fgithub.com\u002Fgaotiexinqu\u002FVideoSeeker)]  \u003C\u002Fdiv>\u003C\u002Ffont>\n\n> VideoSeeker is a novel agentic instance-level video understanding paradigm via native tool invocation with visual prompts.\n\n## 🔥 News\n\n* **[2026\u002F05\u002F14]** 🔥 We have released `VideoSeeker`, a novel agentic instance-level video understanding paradigm via visual prompts.\n\n### Teaser\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fassets\u002Fmain.png\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n### Data Pipeline\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fassets\u002Fdata.png\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n### Performance\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fassets\u002Fbench.png\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n# 🚀 Quickstart\n\n## 🔧 Environmental Setup\n\n### SFT\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fgaotiexinqu\u002FVideoSeeker\n\nconda create -n llamafactory python=3.12\nconda activate LLaMA-Factory\ncd VideoSeeker\u002FLLaMA-Factory\u002FLLaMA-Factory\npip install -e .\n```\n\n### RL\n```\nconda create -n verl python=3.12\nconda activate verl\ncd VideoSeeker\u002Fverl\u002Fverl\nbash scripts\u002Finstall.sh\n```\n\n## 🛠️ Prepare Dataset\n\n### SFT\n\n### RL\n\n### Eval\n\n## ⚡ Start Training\n\n### SFT\n\n### RL\n\n## 📊 Evaluation\n\nWe support multi-benchmark parallel inference and evaluation on various video understanding benchmarks.\n\n### 1. Inference\n\nConfigure your model and data paths in `benchmarks.json`:\n\n```json\n{\n  \"name\": \"V2P-Bench\",\n  \"root\": \"\u002Fpath\u002Fto\u002FV2P-Bench\",\n  \"frames_root\": \"$ROOT\u002Fframes\",\n  \"videos_root\": \"$ROOT\u002Fvideos\",\n  \"dataset_info_path\": \"$ROOT\u002Fdataset_info_1148.json\",\n  \"media_root\": \"$ROOT\u002Fvideos\",\n  \"tools\": \"view_visual_prompt\",\n  \"mode\": \"tool\"\n}\n```\n\nKey configuration options:\n- `root`: Base path for the dataset\n- `tools`: Tool type (`view_visual_prompt` or `crop_video`)\n- `mode`: Inference mode (`direct`, `reasoning`, or `tool`)\n- `$ROOT` will be automatically replaced with the `root` value\n\n```bash\n# Set your checkpoint path in run_multi_inference.sh\nCKPT_PATH=\"\u002Fpath\u002Fto\u002Fyour\u002Fmodel\"\n\n# Run multi-benchmark inference\nbash eval\u002Finference\u002Frun_multi_inference.sh\n```\n\n### 2. Evaluation\n\n```bash\n# Calculate metrics for all benchmarks\nbash eval\u002Fcalu_metrics\u002Fstart_all_eval.sh\n\n# Run LLM-as-judge evaluation for LongVT benchmarks\nbash eval\u002Fcalu_metrics\u002Flongvt\u002Fstart_judge.sh\n```\n\n## 📜 Citation\n\n```\n@article{zhao2026videoseeker,\n  title={VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation},\n  author={Yiming Zhao and Yu Zeng and Wenxuan Huang and Zhen Fang and Qing Miao and Qisheng Su and Jiawei Zhao and Jiayin Cai and Lin Chen and Zehui Chen and Yukun Qi and Yao Hu and Xiaolong Jiang and Feng Zhao},\n  journal={arXiv preprint arXiv:2605.16079},\n  year={2026}\n}\n```","VideoSeeker 是一种通过视觉提示和原生工具调用来实现视频实例级理解的新范式。该项目利用 Python 3.12 构建，其核心功能包括支持多基准并行推理与评估，能够处理视频数据的解析、训练及评估全过程。它引入了独特的数据流水线设计，旨在提高视频内容分析的准确性和效率。VideoSeeker 适用于需要深入理解视频内容的应用场景，如视频监控分析、智能剪辑助手等，特别适合研究者和开发者探索视频理解技术的前沿应用。",2,"2026-06-11 03:58:46","CREATED_QUERY"]