[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80151":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":12,"forks30d":12,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":12,"starSnapshotCount":12,"syncStatus":15,"lastSyncTime":26,"discoverSource":27},80151,"HyperEyes","DeepExperience\u002FHyperEyes","DeepExperience","HyperEyes is a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective.",null,"Python",60,0,52,1,2,5,8,6,44.3,false,"main",[],"2026-06-12 04:01:26","# HyperEyes\n\n**HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents**\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.07177\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.07177-b31b1b.svg\" alt=\"arXiv\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdoi.org\u002F10.48550\u002FarXiv.2605.07177\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDOI-10.48550%2FarXiv.2605.07177-blue.svg\" alt=\"DOI\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FDeepExperience\u002FHyperEyes\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode-GitHub-181717.svg?logo=github\" alt=\"Code\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n> *Search wider, not longer.*\n\nHyperEyes is a **parallel multimodal search agent** that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002FTeaser.png\" alt=\"HyperEyes Teaser\" width=\"90%\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Ci>Comparison between conventional multimodal search agents and HyperEyes. While conventional agents suffer from redundant interaction rounds to process multiple entities, HyperEyes achieves high efficiency by grounding and searching multiple entities concurrently in a single turn.\u003C\u002Fi>\u003C\u002Fp>\n\n---\n\n## 🔥 Highlights\n\n- **Parallel Multimodal Search Agent.** A new agent paradigm operating on a **Unified Grounded Search (UGS)** action space that fuses visual grounding and retrieval into one atomic action, extending text-level parallelism to the visual modality.\n- **Dual-Grained Efficiency-Aware RL Framework.**\n  - **Macro-level — TRACE** (Tool-use Reference-Adaptive Cost Efficiency): a trajectory-level reward whose reference is *monotonically tightened* during training to suppress superfluous tool calls without over-restricting genuine multi-hop search.\n  - **Micro-level — On-Policy Distillation (OPD):** dense token-level corrective signals from an external teacher on failed rollouts, mitigating credit-assignment deficiency of sparse outcome rewards.\n- **Parallel-Amenable Data Synthesis Pipeline.** Covers visual multi-entity and textual multi-constraint queries, with **Progressive Rejection Sampling** to curate efficiency-oriented cold-start trajectories.\n- **IMEB Benchmark.** A human-curated **Image Multi-Entity Benchmark** (300 instances) that jointly evaluates multimodal search **accuracy and efficiency** — the first benchmark to make operational efficiency a first-class metric in multi-entity visual scenarios.\n- **State-of-the-art Performance.** Across six benchmarks, **HyperEyes-30B** surpasses the strongest open-source multimodal search agent of comparable scale by **+9.9% accuracy** with **5.3× fewer** tool-call rounds on average.\n\n---\n\n## 📖 Motivation\n\nThe parametric knowledge of (M)LLMs is constrained by their training cutoff, motivating **search agents** that ground responses in real-time, verifiable information. Yet the prevailing paradigm of multimodal search agents relies heavily on **sequential** tool invocations to deepen the reasoning chain, incurring severe interaction redundancy when queries naturally decompose into independent sub-retrievals.\n\nWhile parallel tool invocation has emerged in text-based agents, *possessing parallel capability does not guarantee efficient search behavior.* When models are optimized purely by accuracy reward, they lack the incentive to prefer compact parallel trajectories over verbose ones — parallelism degrades into brute-force over-searching.\n\nHyperEyes addresses this with the principle of **\"search wider, not longer\"**: dispatching multiple grounded queries concurrently within a round, rather than chaining them sequentially.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Fparallel_vs_serial.png\" alt=\"Parallel vs Serial Search\" width=\"90%\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Ci>Parallel multimodal search vs. conventional serial search: HyperEyes dispatches multiple grounded queries concurrently within a single round, drastically reducing interaction rounds and end-to-end latency.\u003C\u002Fi>\u003C\u002Fp>\n\n---\n\n## 🧠 Method Overview\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Fframework.png\" alt=\"HyperEyes Framework\" width=\"95%\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Ci>Overview of the HyperEyes framework: a Unified Grounded Search (UGS) action space combined with a two-stage training recipe — Parallel-Amenable Data Synthesis for cold-start SFT, followed by Dual-Grained Efficiency-Aware RL with TRACE (trajectory-level) and OPD (token-level) supervision.\u003C\u002Fi>\u003C\u002Fp>\n\nHyperEyes is trained in two stages on top of the **UGS** action space:\n\n1. **Cold-start (SFT) via Parallel-Amenable Data Synthesis.**\n   - Synthesize visual multi-entity & textual multi-constraint queries.\n   - Apply *Progressive Rejection Sampling* to harvest efficiency-oriented trajectories.\n\n2. **Dual-Grained Efficiency-Aware Reinforcement Learning.**\n   - **TRACE (macro):** trajectory-level efficiency reward with monotonically tightening reference, dynamically guiding the policy toward minimum-cost successful trajectories.\n   - **OPD (micro):** on-policy distillation from an expert teacher on *failed* rollouts, providing dense per-token corrective supervision under sparse outcome rewards.\n\nThis dual-grained signal jointly addresses (a) trajectory-level over-searching and (b) token-level credit assignment, producing a policy that is both **wider** in parallel breadth and **shorter** in interaction depth.\n\n---\n\n## 📊 IMEB Benchmark\n\nExisting multimodal search benchmarks evaluate reasoning accuracy while neglecting tool-call efficiency, allowing models to resolve parallelizable queries via verbose sequential trajectories that inflate latency and introduce noisy retrievals. To close this gap, we introduce the **Image Multi-Entity Benchmark (IMEB)**, which elevates **search efficiency to a primary evaluation axis** and constructs queries that *strictly* require concurrent localization and retrieval across multiple entities.\n\n**Dataset.** Curated by **PhD-level annotators** through multiple rounds of **double-blind cross-validation**, IMEB comprises **300 rigorously verified instances** across **6 diverse domains** (Sports, Humanities & History, Entertainment, Daily Life, Consumption, Science, Finance), with an average of **4.6 entities per image**. Every question undergoes rigorous human peer-review and automated filtering to guarantee that it is unambiguously solvable yet strictly necessitates concurrent external tool invocation.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002FIMEB.png\" alt=\"IMEB Benchmark Overview\" width=\"95%\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Ci>Overview of the IMEB benchmark: domain distribution (N = 300), entity-count statistics for each domain, and an example question-answer pair.\u003C\u002Fi>\u003C\u002Fp>\n\n### Cost-Aware Score (CAS)\n\nSince traditional accuracy metrics alone cannot capture parallel operational efficiency, we propose a unified metric that jointly quantifies reasoning correctness and search efficiency:\n\n$$\n\\mathrm{CAS} = \\frac{\\mathrm{Acc}^{2} \\times 100}{N_{\\text{tok}} + 2 N_{\\text{tool}} + 1}\n$$\n\n- **Numerator — Acc² × 100.** The squared accuracy term ensures that *correctness remains the primary optimization objective*; small accuracy gaps are amplified to prevent trivially \"fast but wrong\" agents from scoring high.\n- **Denominator — token & tool cost.** Penalizes token consumption ($N_{\\text{tok}}$, in thousands) and sequential tool-call rounds ($N_{\\text{tool}}$). The weights `(1, 2)` approximate a one-second latency overhead for both generation and tool execution.\n- **Net effect.** CAS facilitates **fair comparison across distinct agent architectures** by jointly rewarding accuracy and operational efficiency on a single axis.\n\n---\n\n## 📈 Main Results\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Ftable2.png\" alt=\"Main Results (Table 2)\" width=\"95%\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Ci>Main results (accuracy % \u002F tool-call turns) on six multimodal search benchmarks. \u003Cb>Bold\u003C\u002Fb> = best, underline = second-best. Δ rows show absolute improvement of HyperEyes (HE) over the second-best open-source model under the Agentic Workflow setting.\u003C\u002Fi>\u003C\u002Fp>\n\n> **Takeaway.** HyperEyes **Pareto-dominates** existing multimodal search agents on the joint accuracy–efficiency frontier: HE-30B (RL) surpasses the strongest open-source agent by **+9.9% accuracy** and reduces tool-call turns by **9.4** on average; HE-235B (RL) further closes the gap to \u002F outperforms top closed-source models such as Gemini-3.1-Pro on multiple benchmarks while remaining substantially more efficient than existing deep search agents.\n\n---\n\n## 🗺️ Roadmap\n\n- [x] Paper figures and project page\n- [x] Cold-start (SFT) training code\n- [x] Dual-Grained Efficiency-Aware RL training code (TRACE + OPD)\n- [ ] IMEB benchmark release (data + evaluation scripts)\n- [ ] Parallel-Amenable Data Synthesis pipeline\n- [ ] HyperEyes-30B \u002F 235B model weights\n- [ ] Inference \u002F demo scripts\n\n> 📌 **Note.** Code, model checkpoints, and the IMEB benchmark will be released soon. Please ⭐ star and watch the repo for updates.\n\n---\n\n## ⭐ Star History\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#DeepExperience\u002FHyperEyes&Date\">\n    \u003Cimg src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=DeepExperience\u002FHyperEyes&type=Date\" alt=\"Star History Chart\" width=\"80%\"\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n## 📜 Citation\n\nIf you find HyperEyes useful for your research, please consider citing:\n\n```bibtex\n@misc{li2026hypereyesdualgrainedefficiencyawarereinforcement,\n      title={HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents}, \n      author={Guankai Li and Jiabin Chen and Yi Xu and Xichen Zhang and Yuan Lu},\n      year={2026},\n      eprint={2605.07177},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.07177}, \n}\n```\n\n\u003C!-- ---\n\n## 📬 Contact\n\nFor questions, suggestions, or collaboration, please open an issue in this repository, or reach out via email:\n\n- **Yuan Lu** (corresponding author) — [`luyuan2@xiaohongshu.com`](mailto:luyuan2@xiaohongshu.com)\n- **Guankai Li** — [`liguankai@xiaohongshu.com`](mailto:liguankai@xiaohongshu.com)\n\n--- -->\n\n\u003C!-- ## 📄 License\n\nCode and benchmark will be released under a permissive open-source license (TBD). Figures are released for academic use. -->\n","HyperEyes 是一个并行多模态搜索代理，它将视觉定位和检索融合为单一原子操作，实现对多个实体的并发搜索，并将推理效率视为首要训练目标。项目采用Python语言开发，通过统一的接地搜索（UGS）动作空间，扩展了文本级别的并行性到视觉模态。此外，它引入了一个双粒度效率感知的强化学习框架，包括轨迹级别的TRACE奖励机制和密集的令牌级纠正信号OPD。HyperEyes还设计了一条适用于数据合成的流水线，以及首个专门针对多实体视觉场景下准确性和效率进行评估的IMEB基准。该项目非常适合需要高效处理多模态信息检索的应用场景，如大规模图像数据库搜索、复杂多媒体内容分析等。","2026-06-11 03:59:26","CREATED_QUERY"]