[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-78610":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},78610,"harbor","harbor-framework\u002Fharbor","harbor-framework","Harbor is a framework for running agent evaluations and creating and using RL environments.","https:\u002F\u002Fharborframework.com\u002F",null,"Python",2398,1134,14,135,0,73,187,317,219,111.16,"Apache License 2.0",false,"main",true,[27,28,29],"evals","rl-environments","terminal-bench","2026-06-12 04:01:23","# Harbor\n\n [![](https:\u002F\u002Fdcbadge.limes.pink\u002Fapi\u002Fserver\u002Fhttps:\u002F\u002Fdiscord.gg\u002F6xWPKhGDbA)](https:\u002F\u002Fdiscord.gg\u002F6xWPKhGDbA)\n[![Docs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocs-000000?style=for-the-badge&logo=mdbook&color=105864)](https:\u002F\u002Fharborframework.com\u002Fdocs)\n[![Cookbook](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCookbook-000000?style=for-the-badge&logo=mdbook&color=105864)](https:\u002F\u002Fgithub.com\u002Fharbor-framework\u002Fharbor-cookbook)\n\n\n\nHarbor is a framework from the creators of [Terminal-Bench](https:\u002F\u002Fwww.tbench.ai) for evaluating and optimizing agents and language models. You can use Harbor to:\n\n- Evaluate arbitrary agents like Claude Code, OpenHands, Codex CLI, and more.\n- Build and share your own benchmarks and environments.\n- Conduct experiments in thousands of environments in parallel through providers like Daytona and Modal. \n- Generate rollouts for RL optimization.\n\nCheck out the [Harbor Cookbook](https:\u002F\u002Fgithub.com\u002Fharbor-framework\u002Fharbor-cookbook) for end-to-end examples and guides.\n\n## Installation\n\n```bash tab=\"uv\"\nuv tool install harbor\n```\nor\n```bash tab=\"pip\"\npip install harbor\n```\n\n\n## Example: Running Terminal-Bench-2.0\nHarbor is the official harness for [Terminal-Bench-2.0](https:\u002F\u002Fgithub.com\u002Flaude-institute\u002Fterminal-bench-2):\n\n```bash \nexport ANTHROPIC_API_KEY=\u003CYOUR-KEY> \nharbor run --dataset terminal-bench@2.0 \\\n   --agent claude-code \\\n   --model anthropic\u002Fclaude-opus-4-1 \\\n   --n-concurrent 4 \n```\n\nThis will launch the benchmark locally using Docker. To run it on a cloud provider (like Daytona) pass the `--env` flag as below:\n\n```bash \n\nexport ANTHROPIC_API_KEY=\u003CYOUR-KEY> \nexport DAYTONA_API_KEY=\u003CYOUR-KEY>\nharbor run --dataset terminal-bench@2.0 \\\n   --agent claude-code \\\n   --model anthropic\u002Fclaude-opus-4-1 \\\n   --n-concurrent 100 \\\n   --env daytona\n```\n\nTo see all supported agents, and other options run:\n\n```bash\nharbor run --help\n```\n\nTo explore all supported third party benchmarks (like SWE-Bench and Aider Polyglot) run:\n\n```bash\nharbor datasets list\n```\n\nTo evaluate an agent and model one of these datasets, you can use the following command:\n\n```bash\nharbor run -d \"\u003Cdataset@version>\" -m \"\u003Cmodel>\" -a \"\u003Cagent>\"\n```\n\n## Citation\n\nIf you use **Harbor** in academic work, please cite it using the “Cite this repository” button on GitHub or the following BibTeX entry:\n\n```bibtex\n@software{Harbor_Framework,\nauthor = {{Harbor Framework Team}},\nmonth = jan,\ntitle = {{Harbor: A framework for evaluating and optimizing agents and models in container environments}},\nurl = {https:\u002F\u002Fgithub.com\u002Fharbor-framework\u002Fharbor},\nyear = {2026}\n}\n```\n","Harbor 是一个用于评估和优化代理及语言模型的框架。它支持多种核心功能，包括但不限于：评估如Claude Code、OpenHands等任意代理；构建并分享自定义基准测试与环境；通过Daytona和Modal等服务提供商在数千个环境中并行执行实验；以及为强化学习优化生成展开。该项目采用Python编写，具有良好的社区支持（GitHub上获得2176星标），并且遵循Apache License 2.0开源许可。Harbor特别适用于需要对AI模型进行大规模测试或比较的研究场景中，无论是本地运行还是利用云端资源都能轻松实现。",2,"2026-06-11 03:56:58","high_star"]