[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80997":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":16,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":15,"starSnapshotCount":15,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},80997,"bench-loop","outsourc-e\u002Fbench-loop","outsourc-e","Local-first CLI for benchmarking LLMs on real hardware — quality, speed, reliability, and a real multi-turn agent loop.","https:\u002F\u002Fbench-loop.com",null,"Python",33,6,1,0,3,2.54,"MIT License",false,"main",true,[23,24,25,26,27,28,29,30,31],"agent","benchmark","cli","evaluation","llm","local-llm","mlx","ollama","vllm","2026-06-12 02:04:09","# BenchLoop\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Foutsourc-e\u002Fbench-loop-web\u002Fmain\u002Fsite\u002Fpublic\u002Fog-image.png\" alt=\"BenchLoop\" width=\"640\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fbench-loop.com\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsite-bench--loop.com-2dd47f?style=flat-square\" alt=\"site\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fbenchloop-cli\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fbenchloop-cli?style=flat-square&color=2dd47f\" alt=\"pypi\" \u002F>\u003C\u002Fa>  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Foutsourc-e\u002Fbench-loop\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-2dd47f?style=flat-square\" alt=\"MIT\" \u002F>\u003C\u002Fa>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fstatus-beta-eab308?style=flat-square\" alt=\"beta\" \u002F>\n\u003C\u002Fp>\n\n**Benchmark local LLMs by what actually matters.**\n\nBenchLoop is a local-first CLI + web app for benchmarking LLMs running on your own hardware or cloud providers. It scores models across seven repeatable suites — quality, speed, reliability, agentic tool use, coding, instruction following — and gives you receipts: per-task outputs, latency, token counts, machine info, scores.\n\nNo accounts, no telemetry. Local models need no API keys; cloud providers use standard OpenAI-compatible auth. Your model, your machine (or your provider), your numbers.\n\n```\n$ benchloop run --model qwen3:8b --suites speed,toolcall,agent\n... 8 tasks, 4 tools, 6 turns avg, 74.6 tok\u002Fs ...\n\nOverall  73.4  ████████░░\nQuality  73.6  ████████░░\nSpeed    78.9  █████████░\nAgent    96.9  █████████▌\n```\n\nPublished runs live at \u003Chttps:\u002F\u002Fbench-loop.com\u002Fleaderboard>. Every completed local benchmark auto-publishes there.\n## Why\n\nHosted LLM leaderboards answer *\"which model wins on a server farm someone else paid for?\"* BenchLoop answers *\"which model + harness + hardware combination actually works for me right now?\"* — the question you have when picking a local stack.\n\nIt is repeatable on purpose: every run persists to disk, the task set is frozen, the scorer is deterministic. If you say \"qwen3:8b scored 89 on my 4090\", anyone can install BenchLoop and verify it.\n\n## Install\n\n### pipx (recommended)\n\n```bash\npipx install benchloop-cli\nbenchloop --version\n```\n\n> The PyPI distribution is named `benchloop-cli` (the bare `benchloop` name was taken by an unrelated dataset library). The installed commands are still `benchloop` and `bench-loop`.\n\n### pip\n\n```bash\npip install benchloop-cli\n```\n\n### From source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Foutsourc-e\u002Fbench-loop\ncd bench-loop\npip install -e .\n```\n\n## Run your first benchmark\n\nMake sure you have a local LLM endpoint running. Anything OpenAI-compatible or Ollama-flavored works:\n\n- Ollama at `http:\u002F\u002Flocalhost:11434` (default)\n- LM Studio at `http:\u002F\u002Flocalhost:1234` (`--provider openai_compat`)\n- MLX \u002F Osaurus at `http:\u002F\u002Flocalhost:8000` (`--provider openai_compat`)\n- vLLM, Jan, llama-server, etc.\n\nThen:\n\n```bash\nbenchloop run \\\n  --model qwen3:8b \\\n  --endpoint http:\u002F\u002Flocalhost:11434 \\\n  --provider ollama\n```\n\nThis runs every default suite, scores them, prints a console report, and persists the full run to `~\u002F.bench-loop\u002Fruns\u002F`.\n\n### Run a subset\n\n```bash\nbenchloop run --model qwen3:8b --suites speed,agent\n```\n\n### Different prompting harness\n\nSame model, four ways to talk to it:\n\n```bash\nbenchloop run --model qwen3:8b --harness raw      # native tool calling\nbenchloop run --model qwen3:8b --harness hermes   # \u003Ctool_call>{...}\u003C\u002Ftool_call>\nbenchloop run --model qwen3:8b --harness qwen     # \u003Cfunction_call>{...}\u003C\u002Ffunction_call>\nbenchloop run --model qwen3:8b --harness pi       # \u003Cthink>...\u003C\u002Fthink> + Hermes tags\n```\n\n### Stamp custom hardware (e.g. when benchmarking through a tunnel)\n\n```bash\nbenchloop run \\\n  --model qwen3:8b \\\n  --endpoint http:\u002F\u002Flocalhost:11435 \\\n  --hardware \"NVIDIA RTX 4090 24GB\" \\\n  --gpu \"NVIDIA RTX 4090\" \\\n  --gpu-memory-gb 24\n```\n\n### Benchmark cloud\u002Fremote APIs\n\nWorks with any OpenAI-compatible endpoint — DashScope, OpenRouter, Together, OpenAI, vLLM with auth, sglang, etc.\n\n```bash\n# Via environment variable\nexport OPENAI_API_KEY=\"sk-...\"\nbenchloop run \\\n  --model qwen3.7-max \\\n  --provider openai_compat \\\n  --endpoint https:\u002F\u002Fdashscope-intl.aliyuncs.com\u002Fcompatible-mode \\\n  --remote\n\n# Or inline\nbenchloop run \\\n  --model gpt-4o \\\n  --provider openai_compat \\\n  --endpoint https:\u002F\u002Fapi.openai.com\u002Fv1 \\\n  --api-key sk-... \\\n  --remote\n```\n\nThe `--remote` flag (auto-detected for non-localhost endpoints) switches to cloud-aware scoring:\n- **Speed** uses streaming TTFT (time-to-first-token) + effective content tok\u002Fs\n- **Overall** = 0.50·quality + 0.25·speed + 0.25·reliability (vs local's 0.55\u002F0.20\u002F0.25)\n- Reasoning models: content tok\u002Fs excludes internal thinking tokens\n\n### API key auth\n\nRequired for vLLM, sglang, and most cloud providers. Two ways to provide it:\n\n```bash\n# 1. Environment variable (recommended)\nexport OPENAI_API_KEY=\"your-key-here\"\nbenchloop run --model your-model --provider openai_compat --endpoint http:\u002F\u002Fyour-server:8000\n\n# 2. CLI flag\nbenchloop run --model your-model --provider openai_compat --endpoint http:\u002F\u002Fyour-server:8000 --api-key your-key-here\n```\n\nThe CLI flag takes precedence over the env var. For Ollama and local providers without auth, neither is needed.\n\n### Launch the local dashboard\n\nv0.2.0+ ships the full FastAPI + React dashboard inside the wheel. After `pipx install benchloop-cli`:\n\n```bash\nbenchloop dashboard\n# → open http:\u002F\u002F127.0.0.1:8877\n```\n\nNeed it to survive browser\u002Fterminal churn? Print a service template instead of keeping the dashboard tied to one shell:\n\n```bash\nbenchloop dashboard --service-template launchd\nbenchloop dashboard --service-template systemd\nbenchloop dashboard --service-template windows-task\n```\n\nThis serves the Models, Benchmark, Leaderboard, Compare, and Chat tabs on a single port, with auto-discovered local providers (Ollama, LM Studio, MLX\u002FOsaurus, vLLM, Jan).\n\nFor hot-reload development against a clone of [`bench-loop-web`](https:\u002F\u002Fgithub.com\u002Foutsourc-e\u002Fbench-loop-web):\n\n```bash\nbenchloop dashboard --dev\n```\n\n## Suites\n\n| Suite | What it scores |\n|---|---|\n| `speed` | Latency, throughput, TTFT, generation tok\u002Fs across short\u002Fmedium\u002Flong contexts |\n| `toolcall` | Structured tool-call correctness across realistic tasks (weather, stocks, email, search) |\n| `coding` | Executable Python tasks verified in a sandboxed subprocess (10s timeout) |\n| `dataextract` | JSON \u002F structured extraction from messy natural language |\n| `instructfollow` | Constraint following, formatting, exactness |\n| `reasonmath` | Small reasoning + math tasks with deterministic checks |\n| `agent` | **Multi-turn agentic tool use.** BenchLoop drives a real loop: model emits a tool call, BenchLoop executes it locally, feeds the result back, model iterates until done. Scores correctness, efficiency, no-hallucination, required-tool coverage. |\n\n## Scoring\n\n```\nLocal:  Overall = 0.55 · quality + 0.20 · speed + 0.25 · reliability\nCloud:  Overall = 0.50 · quality + 0.25 · speed + 0.25 · reliability  (with streaming speed data)\n        Overall = 0.65 · quality + 0.35 · reliability                   (no speed data)\n```\n\n- **Quality** = mean of non-speed suite scores (size-fair).\n- **Speed (local)** = `12.54 · log2(tok\u002Fs) + 0.9`, clamped to 0–100.\n- **Speed (cloud)** = 0.60 · TTFT_score + 0.40 · tok\u002Fs_score, where TTFT uses exponential decay (200ms→100, 2000ms→40) and tok\u002Fs uses a log curve calibrated for 20-150 tok\u002Fs.\n- **Reliability** = pass rate across all tasks.\n- **Agent** = `correct_final + efficient + no_hallucinated_tools + all_required_called`, 25 pts each, averaged across tasks.\n\n## Local web app\n\nA FastAPI backend + React frontend bundle ships alongside the CLI for visualizing runs:\n\n```bash\nbenchloop dashboard   # starts the local web app on :5180\n```\n\nTabs: Models, Benchmark, Leaderboard, Compare runs, Chat, agent trace viewer.\n\n## Publish a run\n\nEvery completed benchmark auto-publishes to \u003Chttps:\u002F\u002Fbench-loop.com\u002Fleaderboard> via `https:\u002F\u002Fapi.bench-loop.com\u002Fsubmit`. Runs are deduped by `(machine_id, run_id)` so the same run from the same machine won't be double-counted.\n\nOpt out:\n\n```bash\nexport BENCHLOOP_NO_SUBMIT=1\n```\n\nYou can still manually export a snapshot for sharing \u002F archiving:\n\n```bash\nbenchloop export --output my-runs.json\n```\n\n## Architecture\n\n```\nbench-loop\u002F                    ← this repo, the CLI + suites + scorers\n  bench_loop\u002F\n    cli.py                     ← `benchloop` entrypoint\n    suites\u002F                    ← speed, toolcall, coding, agent, ...\n    harness.py                 ← raw \u002F hermes \u002F qwen \u002F pi adapters\n    providers\u002F                 ← ollama, openai_compat\n    runner\u002Forchestrator.py     ← drives suites + harnesses\n    tasks\u002F                     ← frozen task YAML fixtures\nbench-loop-web\u002F                ← the web app (separate repo)\n  api\u002F                         ← FastAPI wrapper around bench_loop\n  ui\u002F                          ← local dashboard\n  site\u002F                        ← public bench-loop.com static site\n```\n\n## Status\n\nBenchLoop is **v0.2 beta**. The benchmark surface, scoring, web app, agent loop, four harnesses, and cloud provider support all work end-to-end. Stuff still on the roadmap:\n\n- ~~Streaming TTFT for OpenAI-compatible providers~~ ✅ (v0.2.3+ with `--remote`)\n- Bigger task fixtures (each suite is intentionally small and frozen for v1)\n- Hosted submission flow for community runs\n- Cloud-specific leaderboard on bench-loop.com (filter by local vs remote)\n- More provider adapters (TGI, Bedrock, etc. if there's demand)\n\n## License\n\nMIT. See `LICENSE`.\n","BenchLoop 是一个用于在真实硬件上对大型语言模型（LLM）进行基准测试的本地优先命令行工具，它评估模型的质量、速度、可靠性和多轮代理循环等关键性能。项目使用 Python 开发，提供了一套全面的基准测试套件，能够针对质量、速度、可靠性、工具调用、编码和指令执行等多个维度对模型进行评分，并记录每次任务输出、延迟、令牌数量、机器信息及得分情况。适用于需要根据实际应用场景挑选合适的本地或云端 LLM 模型组合的开发者与研究人员，无需注册账户或提交遥测数据即可完成测试并自动发布结果到在线排行榜。",2,"2026-06-11 04:03:08","CREATED_QUERY"]