[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79673":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},79673,"deep-swe","datacurve-ai\u002Fdeep-swe","datacurve-ai","Measuring frontier coding agents on original, long-horizon engineering tasks","https:\u002F\u002Fdeepswe.datacurve.ai\u002F",null,"Shell",763,40,5,27,0,18,156,525,98,8.84,false,"main",true,[],"2026-06-12 02:03:54","# [DeepSWE](https:\u002F\u002Fdeepswe.datacurve.ai\u002F)\n\nDeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks drawn from active open-source repositories. The benchmark includes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, with isolated environments and program-based verifiers.\n\n## Task format\n\nDeepSWE tasks use the [Harbor](https:\u002F\u002Fwww.harborframework.com\u002Fdocs\u002Ftasks) task format:\n\n```text\ntask.toml         Metadata: repository, base commit, language, prebuilt image, resource limits\ninstruction.md    The prompt the agent sees\nenvironment\u002F      Dockerfile that reproduces the prebuilt image (fallback if the image is unavailable)\ntests\u002F            Verifier: test.sh (entry point) + test.patch (test additions, applied at grading time)\nsolution\u002F         Reference solution (held out from the agent; for human and AI reviewers)\n```\n\nThe verifier exercises the behavior the prompt describes. It accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure.\nThe reference patch in `solution\u002F` is never used at grading time; it exists so reviewers can spot-check correctness offline.\n\n## Quickstart\n\nUse [Pier](https:\u002F\u002Fgithub.com\u002Fdatacurve-ai\u002Fpier) to run the benchmark:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdatacurve-ai\u002Fdeep-swe\nuv tool install datacurve-pier\n\n# Claude Opus 4.7 via Claude Code\nexport ANTHROPIC_API_KEY=...\npier run -p deep-swe\u002Ftasks --agent mini-swe-agent --model anthropic\u002Fclaude-opus-4-7\n\n# GPT-5.5 via Codex\nexport OPENAI_API_KEY=...\npier run -p deep-swe\u002Ftasks --agent mini-swe-agent --model openai\u002Fgpt-5.5\n```\n\n## What is Pier\n\n[Pier](https:\u002F\u002Fgithub.com\u002Fdatacurve-ai\u002Fpier) is a [Harbor](https:\u002F\u002Fwww.harborframework.com\u002Fdocs\u002Ftasks)-compatible framework for sandboxed coding-agent evals. It began as a fork of Harbor to support CLI agents in air-gapped tasks: Harbor blocks all outbound traffic in `allow_internet = false` tasks, including dependency installs and LLM API calls. Pier adds per-agent network allowlists, giving agents only the network access they need while keeping the task environment isolated.\n\nPier also adds more complete trajectory metadata, a better trajectory viewer, and `pier critique run` for analyzing agent trajectories. All leaderboard scores were produced with Pier running `mini-swe-agent` on Modal.\n\n### Agents and models\n\n`mini-swe-agent` is model-agnostic. Pier also drives `claude-code`, `codex`, `gemini-cli`, and `opencode` directly. Pass `--env modal` to run in parallel sandboxes on Modal.\n\n### Subsets and single tasks\n\nDeterministic random subset of the 113-task corpus:\n\n```bash\npier run -p deep-swe\u002Ftasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0\n```\n\nSingle task:\n\n```bash\npier run -p deep-swe\u002Ftasks\u002F\u003Ctask-id> --agent mini-swe-agent\n```\n","DeepSWE 是一个用于评估前沿编码代理在实际长期软件工程项目中的性能的基准测试工具。它覆盖了来自活跃开源仓库的113个任务，涉及TypeScript、Go、Python、JavaScript和Rust五种语言，每个任务都配有独立的环境和基于程序的验证器。该平台通过Harbor框架的任务格式定义任务细节，包括元数据、指令、环境配置、测试脚本及参考解决方案。适用于需要对AI编码能力进行标准化评测的场景，如研究机构、教育领域或企业内部的技术评估。使用Pier框架可以轻松运行这些基准测试，并支持多种主流AI模型，如Claude Code和Codex。",2,"2026-06-11 03:58:12","CREATED_QUERY"]