[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1100":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":7,"htmlUrl":7,"language":8,"languages":7,"totalLinesOfCode":7,"stars":9,"forks":10,"watchers":11,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":14,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":16,"rankGlobal":7,"rankLanguage":7,"license":7,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":17,"hasPages":17,"topics":19,"createdAt":7,"pushedAt":7,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":13,"starSnapshotCount":13,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},1100,"llm-as-a-verifier","llm-as-a-verifier\u002Fllm-as-a-verifier",null,"Python",402,39,367,1,0,3,32,4.81,false,"main",[],"2026-06-12 02:00:23","# LLM-as-a-Verifier: A General-Purpose Verification Framework\n\nLLM-as-a-Verifier is a general-purpose verification framework that provides fine-grained feedback by scaling scoring granularity, repeated verification, and criteria decompositions. It achieves state-of-the-art performance on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) when used as a trajectory reward model for test-time scaling.\n\n## Setup\n\n```bash\npip install google-genai tqdm\n```\n\nCreate a `.env` file with your Vertex AI API key (required for logprob extraction):\n\n```bash\necho \"VERTEX_API_KEY=your_key_here\" > .env\n```\n\n## Directory Structure\n\n```\n.\n  README.md\n  .env                          # API key (create this)\n  scripts\u002F\n    verifier_core.py            # Gemini setup + scoring\n    run_terminal_bench.py       # Terminal-Bench Evaluation\n    run_swe_bench.py            # SWE-bench Verified Evaluation\n  data\u002F\n    terminal_trajs\u002F             # 5 trajectories x 89 tasks each for Terminal-Bench 2.0\n    swebench_verified_trajs\u002F    # 3 trajectories x 500 tasks each for SWE-bench Verified\n  cache\u002F                        # Cached API results (created on first run)\n    cache_terminal_\u003Cagent>.json\n    cache_swebench.json\n  results\u002F                      # Final result tables (written after each run)\n    terminal_\u003Cagent>.txt\n    swebench_verified.txt\n```\n\n## Trajectories\n\n`data\u002Fterminal_trajs\u002Fforge_gpt54\u002F` contains the Forge + GPT-5.4\nsubmission downloaded from the\n[Terminal-Bench 2 Leaderboard](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fharborframework\u002Fterminal-bench-2-leaderboard\u002Ftree\u002Fmain\u002Fsubmissions\u002Fterminal-bench\u002F2.0),\nwith 5 trajectories per task across 89 tasks:\n\n| Scaffold | Base Model | Pass@1 |\n|---|---|---|\n| Forge | GPT-5.4 | 81.8% |\n\n`data\u002Fswebench_verified_trajs\u002F` contains 3 runs for SWE-bench\nVerified (500 instances each) downloaded from the\n[SWE-bench Leaderboard](https:\u002F\u002Fgithub.com\u002Fswe-bench\u002Fexperiments?tab=readme-ov-file#):\n\n| Scaffold | Base Model | Pass@1 |\n|---|---|---|\n| mini-swe-agent | Claude-Opus-4.5 (high reasoning) | 76.8% |\n| mini-swe-agent | Claude-Opus-4.6 | 75.6% |\n| mini-swe-agent | Gemini-3-Flash (high reasoning) | 75.8% |\n\n## Evaluating LLM-as-a-Verifier \n\n### Terminal-Bench\n\n```bash\npython scripts\u002Frun_terminal_bench.py --granularity 20 --n-verifications 4 --criteria 3\n```\n\nExpected:\n\n| Method | Score | Rate |\n|---|---|---|\n| Pass@1 | 72.8\u002F89 | 81.8% |\n| LLM-as-a-Verifier | 76.9±0.3\u002F89 | **86.4%** |\n| Oracle (Bo5) | 80\u002F89 | 89.9% |\n\n### SWE-bench Verified\n\n```bash\npython scripts\u002Frun_swe_bench.py --granularity 20 --n-verifications 4 --criteria 3\n```\n\nExpected:\n\n| Method | Score | Rate |\n|---|---|---|\n| Pass@1 | 380.3\u002F500 | 76.1% |\n| LLM-as-a-Verifier | 389.0±0.4\u002F500 | **77.8%** |\n| Oracle (Bo3) | 422\u002F500 | 84.4% |\n\n## How it works\n\nRather than reducing each distribution into a single discrete score (as in LLM-as-a-Judge), **LLM-as-a-Verifier** approximate the reward of\na trajectory $\\tau$ on task $t$ as:\n\n$$\nR(t, \\tau)\n= \\frac{1}{CK} \\sum_{c=1}^{C} \\sum_{k=1}^{K}\n\\sum_{g=1}^{G} p_{\\theta}(v_g \\mid t, c, \\tau)\\,\\phi(v_g)\n$$\n\n**Where:**\n\n- $C$ = number of evaluation criteria\n- $K$ = number of repeated verifications\n- $G$ = number of score tokens (granularity level)\n- $p_{\\theta}(v_g \\mid t, c, \\tau)$ = probability assigned by model $\\theta$ to score token $v_g$\n- $\\phi(v_g)$ = maps each scoring token to a scalar value\n- $V_{\\text{score}} = \\{v_1, \\ldots, v_G\\}$ = ordered set of discrete score tokens\n\nTo pick the best trajectory among $N$\ncandidates for a given task, we run a round-robin tournament. For every\npair $(i, j)$ the verifier produces $R(t, \\tau_i)$ and $R(t, \\tau_j)$\nusing the formula above. The trajectory with the higher reward\nreceives a win and the trajectory with the most\nwins across all $\\binom{N}{2}$ pairs is selected.\n","LLM-as-a-Verifier 是一个通用的验证框架，通过细化评分粒度、重复验证和标准分解来提供详细的反馈。该项目使用 Python 编写，核心功能包括对任务轨迹的细粒度评分，并在 Terminal-Bench 2 和 SWE-Bench Verified 上实现了领先性能。适用于需要对大规模语言模型生成的任务轨迹进行验证和评估的场景，如代码生成、文本生成等。用户可以通过简单的命令行操作运行项目中的脚本，以评估不同模型在特定基准测试上的表现。",2,"2026-06-11 02:41:36","CREATED_QUERY"]