[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80811":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":15,"starSnapshotCount":15,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},80811,"agentic-vbench","PhiloLabs\u002Fagentic-vbench","PhiloLabs","AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?","https:\u002F\u002Fagenticvbench.com\u002F",null,"Python",57,3,38,0,19,1.81,"Apache License 2.0",false,"main",true,[23,24,25,26,27],"ai-agents","benchmark","harbor","llm-evaluation","video-editing","2026-06-12 02:04:07","# agentic-vbench\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fagenticvbench.com\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌐_Website-agenticvbench.com-green\" alt=\"Website\">\u003C\u002Fa>\n  \u003Ca href=\"paper\u002Fpaper.pdf\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📖_Paper-PDF-blue\" alt=\"Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fagenticvbench.com\u002Fleaderboard\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🏆_Leaderboard-Live-yellow\" alt=\"Leaderboard\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"asset\u002Foverall_fig.png\" alt=\"AgenticVBench: four task families — Assembly, Repair, Sequencing, Repurpose\" width=\"100%\">\n\u003C\u002Fp>\n\n**AgenticVBench** is a 100-task benchmark for evaluating AI agents on real-world video post-production workflows — **Assembly**, **Repair**, **Sequencing**, and **Repurpose**. Tasks are authored by 20 industry experts (avg. 6 years of professional experience) and scored on a 0–1 scale, mixing programmatic verifiers with rubric-based LLM judges.\n\nBuilt on [Harbor](https:\u002F\u002Fwww.harborframework.com\u002F) — agent installation, sandboxed execution, concurrency, and trial scoring are handled for you.\n\n---\n\n## 📊 What's in the suite\n\n| Family | Count | What the agent does |\n|---|---:|---|\n| `agentic_vbench_repair` | 18 | Restore a localized corruption (color shift, blur, low-res, swapped object, glitch, content cut, disfluency, or audio defect) in a clip. |\n| `agentic_vbench_assembly` | 18 | Pick 4 candidate clips from a pool and place them in the correct slot order to satisfy a prompt. |\n| `agentic_vbench_sequencing` | 28 | Re-order a set of shuffled clips into the correct narrative sequence. |\n| `agentic_vbench_repurpose` | 36 | Re-cut a long-form source video into a short vertical clip that satisfies a per-task creative brief. |\n\nThe `repair`, `assembly`, and `sequencing` families score with deterministic per-family judges and ship a bundled **oracle solver** + **broken\u002Frandom baseline** — by construction the oracle scores 1.0 and the baseline scores 0.0. See [`docs\u002FVERIFIER_DESIGN.md`](docs\u002FVERIFIER_DESIGN.md) for the per-family scoring math. The `repurpose` family uses a rubric-based LLM-as-judge against a per-task creative brief (deterministic format checks + Gemini\u002FOpus content judges; same 0–1 reward shape) — running it requires both `GEMINI_API_KEY` and `ANTHROPIC_API_KEY` on the verifier side.\n\nTypical model scores live on the [leaderboard](https:\u002F\u002Fagenticvbench.com\u002Fleaderboard) — use it to gauge whether your run is in the expected range.\n\n---\n\n## 🚀 Quick start\n\n### 1. Install\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FPhiloLabs\u002Fagentic-vbench.git\ncd agentic-vbench\n.\u002Fscripts\u002Finstall-harbor.sh\npython3 -m venv .venv && .venv\u002Fbin\u002Fpip install --upgrade pip\n```\n\n### 2. Run one task with an agent\n\nPick any agent supported by Harbor (`claude-code`, `codex`, `gemini-cli`, `opencode`, …), export the matching API key, and run via the `.\u002Favb` CLI (a thin wrapper that auto-injects per-agent + per-family env vars into `harbor run`):\n\n```bash\n# Claude Code (Anthropic):\nexport ANTHROPIC_API_KEY=...\n.\u002Favb run exp-codec-restore-task01 -a claude-code -m anthropic\u002Fclaude-sonnet-4-6\n\n# Codex (OpenAI):\nexport OPENAI_API_KEY=...\n.\u002Favb run exp-codec-restore-task01 -a codex -m openai\u002Fgpt-5.5\n```\n\nFor `agentic_vbench_repurpose` tasks the **verifier** additionally needs `GEMINI_API_KEY` (the rubric LLM judge uses Gemini for audio\u002Fvideo grading) — export it and `avb` will forward it via Harbor's `--ve` flag. Run `.\u002Favb tasks env \u003Ctask>` to see what credentials a given task and agent combo need.\n\n**Bringing your own agent.** The four agents listed above are vendor-native Harbor agents that work against the task set out of the box. Custom agents — including open-source harnesses and proprietary stacks — plug into Harbor through a small adapter. See [Harbor's agents docs](https:\u002F\u002Fwww.harborframework.com\u002Fdocs\u002Fagents) for the adapter contract.\n\n**Free smoke test (no agent API spend).** Every repair\u002Fassembly\u002Fsequencing task ships a bundled oracle solver. Use it to confirm the harness is wired up end-to-end:\n\n```bash\n.\u002Favb run exp-codec-restore-task01 -a oracle -e docker\n# reward.json → ≈ 1.0, ~30 s on a cached image, zero agent cost\n```\n\n**Time + cost budget.** A real-agent rollout on Modal typically takes ~10 min per task wall clock. Cost depends entirely on agent + model — order-of-magnitude $0.10–$2 per task with mid-tier models, scaling linearly with the agent's token use. Plan accordingly for a 100-task sweep.\n\n**Here's what a task prompt actually looks like** (`exp-codec-restore-task01`):\n\n> # Restore A Muffled Stretch Of Audio\n>\n> I have a short mono speech recording at `\u002Fworkspace\u002Fmaterials\u002Fnoisy.wav`. For a stretch in it, the audio sounds muffled — like the high end has been chopped off and the voice lost its sparkle. The rest of the recording sounds clean and full.\n>\n> Please restore the muffled stretch so it sounds as clear and full as the rest of the recording. Leave the already-clean parts unchanged.\n>\n> ## What to deliver\n> - `\u002Fworkspace\u002Foutput\u002Fenhanced.wav` — 16-bit PCM mono at 16 kHz, same total length (sample count) as the input.\n\nEach task ships its own such brief at `tasks\u002F\u003Cfamily>\u002F\u003Ctask>\u002Fsteps\u002Fsolve\u002Finstruction.md`.\n\n**Inspect the result.** Each trial drops four artifacts under `jobs\u002F\u003Cjob-name>\u002F\u003Ctrial-id>\u002F`:\n\n| File | What it is |\n|---|---|\n| `steps\u002Fsolve\u002Fverifier\u002Freward.json` | Final score + per-metric breakdown. |\n| `agent\u002Ftrajectory.json` | Full event stream Harbor captured for the agent (tool calls, tool results, model messages, final output). |\n| `result.json` | Per-trial Harbor summary (timings, exit codes, exception info). |\n| `trial.log` | Combined stdout\u002Fstderr stream for the whole trial. |\n\n```bash\n.\u002Favb results show          # rewards from the latest job\ncat jobs\u002F\u003Cjob-name>\u002F*\u002Fsteps\u002Fsolve\u002Fverifier\u002Freward.json\n# {\n#   \"reward\": 0.55,\n#   \"details\": { \"reason\": \"ok\", ... }\n# }\n```\n\nWhile a trial is running, `.\u002Favb run` prints `tail -F jobs\u002F\u003Cjob>\u002F*\u002Ftrial.log` — copy that to watch progress in another shell.\n\nRun `.\u002Favb -h` to see every subcommand (`tasks list \u002F check \u002F env`, `run`, `rollout`, `results show`).\n\n### 3. Run a full family in parallel\n\n```bash\nexport ANTHROPIC_API_KEY=...\nexport MODAL_TOKEN_ID=... MODAL_TOKEN_SECRET=...\n\n.\u002Favb rollout --family repair --agent claude-code --env modal --max-parallel 20\n```\n\nSame pattern for the other families. Per-task rewards land in `logs\u002Frollout-results.tsv`; full per-trial artifacts (agent trajectory, verifier breakdown) under `jobs\u002F\u003Cjob-name>\u002F`.\n\n---\n\n## 🏆 Submitting to the leaderboard\n\nOnce you've run all 100 tasks, zip the `jobs\u002F` directory (which contains the `reward.json` + `trajectory.json` + `result.json` per trial) and follow the submission flow at [agenticvbench.com](https:\u002F\u002Fagenticvbench.com\u002F) — that page has the email template + a Google Drive link prompt. Reviewers verify that every task in the suite has an intact trajectory and that scores fall in `[0, 1]`, then publish to the [leaderboard](https:\u002F\u002Fagenticvbench.com\u002Fleaderboard).\n\n---\n\n## ⚙️ Supported executors\n\nAny executor that [Harbor](https:\u002F\u002Fwww.harborframework.com\u002F) supports works — pass `-e \u003Cexecutor>` to `harbor run` (or `.\u002Favb run`). Common picks:\n\n| Executor | Use when | Required env |\n|---|---|---|\n| `docker` | Local sanity checks, single-task debugging | none |\n| `modal` | Large parallel runs across the suite | `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET` |\n| `daytona` | Cloud sandboxes (alternative to Modal) | `DAYTONA_API_KEY` |\n| `e2b` | Sandbox-as-a-service for code execution | `E2B_API_KEY` |\n| `runloop` | Long-running cloud workspaces | `RUNLOOP_API_KEY` |\n\nPlus `apple_container`, `gke`, `singularity`, `tensorlake`, and anything else Harbor adds — see Harbor's `--env` choices in `harbor run -h` for the live list.\n\nAll task materials are hosted on Hugging Face under [`ameddserM\u002Fagentic_vbench_video_*`](https:\u002F\u002Fhuggingface.co\u002FameddserM) and baked into each task's Docker image at build time, so the same image runs on any executor without provider-specific configuration.\n\n---\n\n## 📁 Repo layout\n\n```\nagentic-vbench\u002F\n├── tasks\u002F                              # 100 Harbor task directories\n│   ├── agentic_vbench_repair\u002F          # 18 repair tasks\n│   ├── agentic_vbench_assembly\u002F        # 18 assembly tasks\n│   ├── agentic_vbench_sequencing\u002F      # 28 sequencing tasks\n│   └── agentic_vbench_repurpose\u002F       # 36 repurpose tasks\n├── scripts\u002F\n│   ├── install-harbor.sh               # Harbor CLI pin\n│   ├── parallel_rollout.py             # batched rollout + reward collection\n│   ├── monitor_job.py                  # tail a running trial\n│   └── _task_paths.py                  # task-name → path resolver\n├── docs\u002FVERIFIER_DESIGN.md             # per-family scoring math\n└── README.md, LICENSE, AGENTS.md\n```\n","AgenticVBench 是一个用于评估AI代理在实际视频后期制作工作流程中表现的100任务基准测试平台，涵盖了组装、修复、排序和再利用四大任务家族。该项目基于Harbor框架构建，支持代理安装、沙盒执行、并发处理以及试运行评分等功能。每个任务由平均拥有6年行业经验的专业人士设计，并通过程序验证器与基于规则的大规模语言模型评判员进行0-1分制打分。特别地，“再利用”任务需要使用GEMINI_API_KEY 和 ANTHROPIC_API_KEY 来调用外部API完成评价。AgenticVBench适用于研究者、开发者及对AI在视频编辑领域应用感兴趣的个人或团队，以帮助他们更好地理解和改进AI代理的能力。",2,"2026-06-11 04:02:25","CREATED_QUERY"]