[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82150":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":27,"discoverSource":28},82150,"VibeSearchBench","VibeBench\u002FVibeSearchBench","VibeBench","🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.",null,"Python",878,12,2,0,104,776,51,7.34,"MIT License",false,"main",true,[],"2026-06-12 02:04:23","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"assets\u002Fimg\u002Flogo.png\" width=\"160\" alt=\"VibeSearchBench Logo\">\n\n# VibeSearchBench\n\n[![Tasks](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftasks-200-blue)](#tasks)\n[![Best F1](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fbest_triplet_F1-30.3-green)](#leaderboard)\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-PDF-red)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.27882)\n[![Leaderboard](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fleaderboard-live-purple)](https:\u002F\u002Fvibebench.github.io\u002FVibeSearchBench.github.io\u002Fleaderboard.html)\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fproject_page-live-2563eb)](https:\u002F\u002Fvibebench.github.io\u002FVibeSearchBench.github.io\u002F)\n[![Dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Dataset-yellow)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVibeSearchBench\u002FVibeSearchBench)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-orange)](LICENSE)\n\n> \u003Cspan style=\"color:#dc2626;background:rgba(220,38,38,0.12);padding:0.15em 0.4em;border-radius:4px;font-weight:800;box-shadow:inset 0 -2px 0 rgba(220,38,38,0.45)\">Hardest\u003C\u002Fspan> — vague multi-turn proactive search in the wild.\u003Cbr>\n> \u003Cspan style=\"color:#15803d;background:rgba(22,163,74,0.14);padding:0.15em 0.4em;border-radius:4px;font-weight:800;box-shadow:inset 0 -2px 0 rgba(22,163,74,0.4)\">Verifiable\u003C\u002Fspan> — schema-free knowledge graph evaluation.\u003Cbr>\n> \u003Cspan style=\"color:#7c3aed;background:rgba(124,58,237,0.14);padding:0.15em 0.4em;border-radius:4px;font-weight:800;box-shadow:inset 0 -2px 0 rgba(124,58,237,0.4)\">Long-horizon\u003C\u002Fspan> — persona-driven progressive disclosure.\n\n\u003C\u002Fdiv>\n\n---\n\n## Leaderboard\n\nBrowse the full leaderboard and multi-turn task trajectories at **[vibebench.github.io\u002FVibeSearchBench.github.io](https:\u002F\u002Fvibebench.github.io\u002FVibeSearchBench.github.io\u002F)**.\n\n**Evaluation:**\n\n* **Primary metric: Triplet F1.** Predicted knowledge graphs are matched against ground truth via LLM-as-judge node alignment and triplet semantic equivalence.\n* **Multi-turn interaction.** Each task uses a persona-driven user simulator with progressive disclosure; agents may search, visit pages, and run code across many turns.\n* **Best reported score:** **30.3** triplet F1 (Claude Opus 4.6, OpenClaw).\n\n## Tasks\n\n200 tasks across 2 subsets and 20 domains. Each task pairs a vague initial query with a ground-truth knowledge graph.\n\n| Split | Count | Description |\n|-------|-------|-------------|\n| `pro` | 100 | Professional research — literature reviews, market analysis, technical due diligence |\n| `daily` | 100 | Daily-life search — shopping, travel, lifestyle with evolving preferences |\n\nReal users rarely specify full intent upfront. **VibeSearch** captures bidirectional convergence: agents interleave partial results with follow-up questions while users progressively disclose needs.\n\n### Dataset\n\nAvailable on Hugging Face: [VibeSearchBench\u002FVibeSearchBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVibeSearchBench\u002FVibeSearchBench)\n\n| Field | Description |\n|-------|-------------|\n| `qid` | Unique task identifier |\n| `question` | Full research query with constraints |\n| `user_persona` | Persona for the progressive-disclosure simulator |\n| `nodes` \u002F `triples` | Ground-truth knowledge graph |\n\n---\n\n## Quick Start\n\n### GeneralAgent (LLM-based)\n\nUses an OpenAI-compatible LLM to drive multi-step web research.\n\n```bash\n# Full pipeline (inference + evaluation)\nMODEL_NAME=glm-5.1 VLLM_URL=http:\u002F\u002Fhost\u002Fv1 bash scripts\u002Frun_all.sh\n\n# Inference only\nMODEL_NAME=kimi-k2.5 VLLM_URL=http:\u002F\u002Fhost\u002Fv1 bash scripts\u002Frun_inference.sh\n\n# With model config profile\nMODEL_CONFIG=model_config.yaml MODEL_PROFILE=seed2_0_pro bash scripts\u002Frun_all.sh\n```\n\n### OpenClaw Agent (CLI-based)\n\nWraps the OpenClaw CLI into the benchmark. Requires a running OpenClaw gateway.\n\n```bash\n# Default (simulated mode)\nbash scripts\u002Frun_openclaw.sh\n\n# Direct mode (no user simulation)\nMODE=direct bash scripts\u002Frun_openclaw.sh\n\n# Custom data and model\nDATA_PATH=tasks\u002Fmy_tasks MODE=simulated OPENCLAW_MODEL=my-model bash scripts\u002Frun_openclaw.sh\n```\n\nKey OpenClaw env vars: `GATEWAY_PORT` (default 18789), `SOURCE_DIR`, `IDLE_THRESHOLD`, `MAX_NUDGE`, `OPENCLAW_MODEL`.\n\n### Evaluation Only\n\n```bash\nTRAJS_DIR=results\u002Ftrajs\u002Fglm-5.1_custom_serper bash scripts\u002Frun_eval.sh\n```\n\n### Direct Python Usage\n\n```bash\n# GeneralAgent: full pipeline\npython run.py \\\n  --agent-type general \\\n  --model glm-5.1 \\\n  --vllm-server-url http:\u002F\u002Fhost\u002Fv1 \\\n  --tool-set custom \\\n  --num-samples 4 \\\n  --grader-type gemini \\\n  --grader-api-url https:\u002F\u002F... \\\n  --grader-api-key YOUR_KEY\n\n# GeneralAgent: inference only\npython run.py \\\n  --agent-type general \\\n  --model glm-5.1 \\\n  --vllm-server-url http:\u002F\u002Fhost\u002Fv1 \\\n  --skip-eval\n\n# OpenClaw agent\npython run.py \\\n  --agent-type openclaw \\\n  --gateway-port 18789 \\\n  --mode simulated \\\n  --user-model doubao-seed-2-0-pro \\\n  --user-model-url http:\u002F\u002Fhost\u002Fv1 \\\n  --user-model-api-key YOUR_KEY \\\n  --num-samples 4\n\n# Eval only\npython run.py \\\n  --eval-only \\\n  --trajs-dir results\u002Ftrajs\u002Fglm-5.1_custom_serper \\\n  --grader-type gemini \\\n  --grader-api-url https:\u002F\u002F...\n```\n\n---\n\n## Project Structure\n\n```\nVibeSearchBench\u002F\n├── agent\u002F                          # Agent implementations\n│   ├── general_agent.py            # GeneralAgent (OpenAI-compatible, single\u002Fmulti-agent)\n│   ├── openclaw_agent.py           # OpenClaw agent wrapper\n│   ├── llm.py                      # LLM client utilities\n│   ├── prompts.py                  # Prompt templates\n│   └── toolkit.py                  # ToolKit (search \u002F visit \u002F python via Serper)\n├── eval\u002F                           # Evaluation module\n│   ├── grader.py                   # GraderClient (OpenAI \u002F Gemini backends)\n│   └── evaluator.py                # KG evaluation: node F1, triplet F1\n├── scripts\u002F                        # Bash\u002FPython scripts\n│   ├── run_all.sh                  # Full pipeline (inference + evaluation)\n│   ├── run_inference.sh            # Agent inference only\n│   ├── run_eval.sh                 # Evaluation only\n│   ├── run_openclaw.sh             # OpenClaw evaluation\n│   └── build_website_data.py       # Export data for the project page\n├── viberesearch_query_synthesis\u002F   # Query synthesis module\n├── website\u002F                        # Static site template (deployed via github.io repo)\n├── tasks\u002F                          # Task JSON files (benchmark data)\n├── results\u002F                        # Output (auto-created)\n├── model_config.yaml               # LLM model profiles\n└── run.py                          # Main entry point\n```\n\n## Configuration\n\n### Environment Variables\n\n| Variable | Description | Default |\n|---|---|---|\n| `MODEL_NAME` | Model name for chat API | `glm-5.1` |\n| `VLLM_URL` | Base URL for chat API | (none) |\n| `TOOL_SET` | `custom` or `builtin` | `custom` |\n| `API_KEY` | API key for main model | (empty) |\n| `MULTI_AGENT` | Set to `1` for multi-agent mode | `0` |\n| `SERPER_API_KEY` | Serper API key for web search | (preset) |\n| `SUMMARIZE_URL` | vLLM URL for page summarization | (preset) |\n| `SUMMARIZE_MODEL` | Model for summarization | `qwen3-30b-a3b-instruct` |\n| `CODE_SANDBOX_URL` | HTTP sandbox for Python tool | (preset) |\n| `GEMINI_API_KEY` | API key for Gemini grader | (preset) |\n| `GEMINI_API_URL` | API URL for Gemini grader | (preset) |\n\n### Tool Sets\n\n- **custom** (default): search (Serper) + visit (Serper scrape + LLM summarize) + python (HTTP sandbox)\n- **builtin**: search + open + find (requires `gpt_oss` package)\n\n### Agent Modes\n\n- **Single-agent**: One agent handles the entire query\n- **Multi-agent** (`MULTI_AGENT=1`): Main agent can spawn sub-agents for parallel research\n\n## Output Format\n\n### Trajectories (`results\u002Ftrajs\u002F{experiment}\u002F`)\n\nOne JSONL file per task (`{task_id}.jsonl`), each line is one sample:\n\n```json\n{\"qid\": \"task_042_...\", \"sample_idx\": 0, \"question\": \"...\", \"messages\": [...], \"response\": \"...\", \"termination\": \"answer\", ...}\n```\n\n### Evaluation (`results\u002Feval\u002F{experiment}\u002F`)\n\n- `{task_id}_sample{N}.json` — Per-trajectory evaluation with node\u002Ftriplet metrics\n- `item_ratings.json` — All per-item results\n- `summary.json` — Aggregated metrics (avg@N, best@N)\n\n## Dependencies\n\n```\nopenai aiohttp httpx tqdm transformers json_repair\n```\n\n## Evaluation Metrics\n\nTwo-phase LLM-as-judge evaluation:\n\n1. **Node matching**: LLM matches predicted entities to ground-truth entities (alias\u002Ftranslation-aware)\n2. **Triplet matching**: For matched entity pairs, LLM judges relation semantic equivalence\n\nMetrics: Precision, Recall, F1 at both node and triplet levels, with avg@N and best@N aggregation across samples.\n\n## License\n\nThis project is released under the [MIT License](LICENSE).\n\n---\n\n\u003Cp align=\"center\">VibeSearchBench · Rednote-Hilab &amp; Unipat AI\u003C\u002Fp>\n","VibeSearchBench 是一个用于评估多轮主动搜索任务的基准项目。该项目通过200个涵盖专业研究和日常生活搜索的任务，测试模型在模糊初始查询下的表现，并使用无模式知识图谱进行验证。核心功能包括基于LLM的节点对齐与三元组语义等价性评估、以及支持多轮交互的用户模拟器，能够逐步揭示需求并允许代理执行搜索、页面访问及代码运行等多种操作。适用于需要长时间跨度、个性化驱动的信息检索场景，如市场分析、技术尽职调查或日常购物旅行规划等。","2026-06-11 04:07:53","CREATED_QUERY"]