[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77808":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":8,"languages":8,"totalLinesOfCode":8,"stars":9,"forks":10,"watchers":11,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":17,"compositeScore":18,"rankGlobal":8,"rankLanguage":8,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":8,"pushedAt":8,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":13,"starSnapshotCount":13,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},77808,"Awesome-Code-as-Agent-Harness-Papers","YennNing\u002FAwesome-Code-as-Agent-Harness-Papers","YennNing",null,391,28,6,4,0,46,60,272,138,4.39,"MIT License",false,"main",true,[],"2026-06-12 02:03:44","# Awesome Code as Agent Harness Papers\n\n[![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.18747-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18747)\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-code--as--harness.github.io-1f6feb?logo=googlechrome&logoColor=white)](https:\u002F\u002Fcode-as-harness.github.io\u002Fcode-as-harness-webpage\u002F)\n[![HF #1 Paper of the Day](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HF-%231%20Paper%20of%20the%20Day-FFD21E)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.18747)\n[![@_akhaliq](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%40__akhaliq-6366F1?logo=x&logoColor=white&labelColor=000000)](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F2056900568921133565?s=20)\n![Visitors](https:\u002F\u002Fvisitor-badge.laobi.icu\u002Fbadge?page_id=YennNing.Awesome-Code-as-Agent-Harness-Papers)\n\nThis repository accompanies the survey [**Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18747).\nWe study the emerging role of code in agentic AI: code is no longer only a generated artifact, but increasingly serves as an executable, inspectable, and stateful harness through which agents reason, act, model environments, receive feedback, and coordinate. The repository organizes representative papers around three connected layers: **Harness Interface**, **Harness Mechanisms**, and **Scaling the Harness**, covering directions such as coding assistants, GUI\u002FOS automation, scientific discovery, and embodied intelligence.\n\n> [!TIP]\n> 👋 We welcome paper suggestions, pull requests, and collaborations on code as agent harness. Please contact us at `xuyingn2@illinois.edu`, `kt42@illinois.edu`, `twei10@illinois.edu`, `zihaoli5@illinois.edu`, and `bei4@illinois.edu`. We will keep updating this repository with recent work on code-centric agentic systems and harness engineering.\n\n> [!NOTE]\n> 📚 If you find this resource useful, please cite and [![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYennNing\u002FAwesome-Code-as-Agent-Harness-Papers?style=social)](https:\u002F\u002Fgithub.com\u002FYennNing\u002FAwesome-Code-as-Agent-Harness-Papers) the repo:\n>\n>\n> ```bibtex\n> @article{ning2026codeasharness,\n>   title   = {Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems},\n>   author  = {Ning, Xuying and Tieu, Katherine and Fu, Dongqi and Wei, Tianxin and Li, Zihao and Bei, Yuanchen and others},\n>   journal = {arXiv preprint arXiv:2605.18747},\n>   year    = {2026}\n> }\n> ```\n\n![Framework overview](figs\u002Foverview.png)\n\n## 🔔 News\n\n**[2026-05]** 🚀 Our survey ***Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems*** is available on [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18747). Slides and project page links will be added here once available.\n\n## 📋 Table of Contents\n\n- [🔔 News](#-news)\n- [📋 Table of Contents](#-table-of-contents)\n- [🧩 Harness Interface](#-harness-interface)\n  - [💭 Code for Reasoning](#-code-for-reasoning)\n  - [🤖 Code for Acting](#-code-for-acting)\n  - [🌍 Code for Environment Modeling](#-code-for-environment-modeling)\n- [🛠️ Harness Mechanisms](#%EF%B8%8F-harness-mechanisms)\n  - [🗺️ Planning for Code Agents](#%EF%B8%8F-planning-for-code-agents)\n  - [🧠 Memory and Context Engineering](#-memory-and-context-engineering)\n  - [🔧 Tool Usage for Code Agents](#-tool-usage-for-code-agents)\n  - [🧪 Feedback-Guided Iterative Debugging](#-feedback-guided-iterative-debugging)\n- [👥 Scaling the Harness: Multi-Agent Code-Centric Systems](#-scaling-the-harness-multi-agent-code-centric-systems)\n  - [🎭 Functional Role Specialization](#-functional-role-specialization)\n  - [💬 Interaction Modes](#-interaction-modes)\n  - [🕸️ Workflow Topology](#%EF%B8%8F-workflow-topology)\n  - [⚡ Execution Feedback Integration](#-execution-feedback-integration)\n  - [🔄 Shared-Harness Synchronization](#-shared-harness-synchronization)\n  - [🏛️ Shared Harness Representation](#%EF%B8%8F-shared-harness-representation)\n  - [🎯 Harness-State Convergence](#-harness-state-convergence)\n- [🚀 Applications and Emerging Fields](#-applications-and-emerging-fields)\n  - [💻 Code Assistants](#-code-assistants)\n  - [🖥️ GUI \u002F OS Agents](#%EF%B8%8F-gui--os-agents)\n  - [🔬 Scientific Discovery Agents](#-scientific-discovery-agents)\n  - [🤖 Autonomous Embodied Agents](#-autonomous-embodied-agents)\n\n---\n\n## 🧩 Harness Interface\n\nCode as the basic interface between a model and its task environment. Programs convert model outputs into executable, inspectable, and stateful structures: code makes reasoning *executable*, action *programmable*, and environment state *inspectable*.\n\n![Harness interface](figs\u002Fharness_interface.png)\n\n### 💭 Code for Reasoning\n\nPrograms externalize internal logic into verifiable computation, allowing interpreters, symbolic solvers, execution traces, or process rewards to check and refine intermediate steps.\n\n#### Program-Delegated Reasoning\n\n| Paper | Venue |\n| --- | --- |\n| [Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.12588) | TMLR 2023 |\n| [MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03731) | ICLR 2024 |\n| [Chain of Code: Reasoning with a Language Model-Augmented Code Emulator](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04474) | ICML 2024 |\n| [Method-Based Reasoning for Large Language Models: Extraction, Reuse, and Continuous Improvement](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04289) | arXiv 2025 |\n| [Code-Enabled Language Models Can Outperform Reasoning Models on Diverse Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20909) | arXiv 2025 |\n| [When Do Program-of-Thought Works for Reasoning?](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F29721) | AAAI 2024 |\n| [PAL: Program-aided Language Models](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Fgao23f.html) | ICML 2023 |\n| [Show Your Work: Scratchpads for Intermediate Computation with Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.00114) | arXiv 2021 |\n| [Reasoning Like Program Executors](https:\u002F\u002Faclanthology.org\u002F2022.emnlp-main.48\u002F) | EMNLP 2022 |\n| [Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments](https:\u002F\u002Faclanthology.org\u002F2025.findings-acl.817\u002F) | ACL 2025 Findings |\n| [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=_VjQlMeSB_J) | NeurIPS 2022 |\n\n#### Hybrid Symbolic–Neural Execution\n\n| Paper | Venue |\n| --- | --- |\n| [Self-Verifying Reflection Helps Transformers with CoT Reasoning](https:\u002F\u002Fneurips.cc\u002Fvirtual\u002F2025\u002Fposter\u002F119948) | NeurIPS 2025 |\n| [SSR: Socratic Self-Refine for Large Language Model Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.10621) | arXiv 2025 |\n| [CodeSteer: Symbolic-Augmented Language Models via Code\u002FText Guidance](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04350) | ICML 2025 |\n| [Graph of Thoughts: Solving Elaborate Problems with Large Language Models](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F29720) | AAAI 2024 |\n| [Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01700) | IROS 2025 |\n\n#### Iterative Code-Grounded Reasoning\n\n| Paper | Venue |\n| --- | --- |\n| [NExT: Teaching Large Language Models to Reason about Code Execution](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14662) | ICML 2024 |\n| [What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05703) | arXiv 2025 |\n| [Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15118) | ICML 2025 |\n| [CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18471) | arXiv 2025 |\n| [RLTF: Reinforcement Learning from Unit Test Feedback](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04349) | TMLR 2023 |\n| [RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02089) | ICML 2025 |\n| [Execution guided line-by-line code generation](https:\u002F\u002Fopenreview.net\u002Fforum?id=ySFDPoiANu) | NeurIPS 2025 |\n| [R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21668) | arXiv 2025 |\n| [CYCLE: Learning to Self-Refine the Code Generation](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Ffull\u002F10.1145\u002F3649825) | OOPSLA 2024 |\n| [StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.251\u002F) | ACL 2024 |\n| [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=WaGvb7OzySA) | NeurIPS 2022 |\n| [CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation](https:\u002F\u002Faclanthology.org\u002F2025.findings-acl.428\u002F) | ACL 2025 Findings |\n| [SatLM: Satisfiability-Aided Language Models Using Declarative Prompting](https:\u002F\u002Fopenreview.net\u002Fforum?id=8tt9KxyV2s) | NeurIPS 2023 |\n| [Self-Edit: Fault-Aware Code Editor for Code Generation](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.45\u002F) | ACL 2023 |\n\n### 🤖 Code for Acting\n\nGenerated programs serve as policies, tool calls, behavior trees, or reusable skills for embodied, GUI, software, and tool-use environments.\n\n#### Grounded Skill Selection\n\n| Paper | Venue |\n| --- | --- |\n| [Do As I Can, Not As I Say: Grounding Language in Robotic Affordances](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.01691) | CoRL 2022 |\n| [Robots That Ask for Help: Uncertainty Alignment for Large Language Model Planners](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.01928) | CoRL 2023 |\n| [Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10021) | CoRL 2023 |\n| [SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.03836) | arXiv 2026 |\n| [Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition](https:\u002F\u002Fproceedings.mlr.press\u002Fv229\u002Fha23a.html) | CoRL 2023 |\n| [Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10611448\u002F) | ICRA 2024 |\n\n#### Programmatic Policy Generation\n\n| Paper | Venue |\n| --- | --- |\n| [RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16117) | ICML 2024 |\n| [CP-Agent: Agentic Constraint Programming](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07468) | arXiv 2025 |\n| [LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.02002) | ICRA 2026 |\n| [NormCode: A Semi-Formal Language for Auditable AI Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.10563) | arXiv 2025 |\n| [ALRM: Agentic LLM for Robotic Manipulation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.19510) | arXiv 2026 |\n| [RACAS: Controlling Diverse Robots With a Single Agentic System](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.05621) | arXiv 2026 |\n| [ReAct: Synergizing Reasoning and Acting in Language Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=WE_vluYUL-X) | ICLR 2023 |\n| [GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs44182-025-00065-w) | npj Robotics 2026 |\n| [Code as Policies: Language Model Programs for Embodied Control](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10160591\u002F) | ICRA 2023 |\n| [Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04268) | arXiv 2025 |\n| [Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models](https:\u002F\u002Fwww.ijcai.org\u002Fproceedings\u002F2025\u002F980) | IJCAI 2025 |\n\n#### Lifelong Code-Based Agents\n\n| Paper | Venue |\n| --- | --- |\n| [Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.18597) | arXiv 2025 |\n| [ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24219) | arXiv 2025 |\n| [UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24533) | arXiv 2026 |\n| [Voyager: An Open-Ended Embodied Agent with Large Language Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=ehfRiF0R3a) | TMLR 2023 |\n| [Lifelong Language-Conditioned Robotic Manipulation Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.05160) | arXiv 2026 |\n\n### 🌍 Code for Environment Modeling\n\nProgram states, repositories, traces, simulators, and tests represent state, dynamics, and feedback signals for agent interaction.\n\n#### Structured World Representations\n\n| Paper | Venue |\n| --- | --- |\n| [From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries](https:\u002F\u002Fopenreview.net\u002Fforum?id=Ew8bJkSt3g) | NeurIPS 2025 |\n| [PoE-World: Compositional World Modeling with Products of Programmatic Experts](https:\u002F\u002Fopenreview.net\u002Fforum?id=obwRcksFZw) | NeurIPS 2025 |\n| [Code2World: A GUI World Model via Renderable Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.09856) | arXiv 2026 |\n| [Code2Worlds: Empowering Coding LLMs for 4D World Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.11757) | arXiv 2026 |\n| [ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.824\u002F) | EMNLP 2023 |\n\n#### Execution-Trace World Modeling\n\n| Paper | Venue |\n| --- | --- |\n| [SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01006) | NeurIPS 2024 |\n| [CWM: An Open-Weights LLM for Research on Code Generation with World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.02387) | arXiv 2025 |\n| [Reinforcement World Model Learning for LLM-based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.05842) | arXiv 2026 |\n| [Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10090) | arXiv 2026 |\n| [Aligning Agentic World Models via Knowledgeable Experience Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.13247) | arXiv 2026 |\n| [WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Ffile\u002F820c61a0cd419163ccbd2c33b268816e-Paper-Conference.pdf) | NeurIPS 2024 |\n\n#### Code-Grounded Evaluation Environments\n\n| Paper | Venue |\n| --- | --- |\n| [CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03065) | ICML 2024 |\n| [LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code](https:\u002F\u002Fopenreview.net\u002Fforum?id=chfJJYC3iL) | ICLR 2025 |\n| [SWE-bench: Can Language Models Resolve Real-world Github Issues?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06770) | ICLR 2024 |\n| [AgentBench: Evaluating LLMs as Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.03688) | ICLR 2024 |\n| [CoRe: Benchmarking LLMs' Code Reasoning Capabilities through Static Analysis Tasks](https:\u002F\u002Fneurips.cc\u002Fvirtual\u002F2025\u002Fposter\u002F121601) | NeurIPS 2025 |\n| [Geogrambench: Benchmarking the geometric program reasoning in modern llms](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17653) | arXiv 2025 |\n| [CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.13962) | arXiv 2026 |\n| [Endless Terminals: Scaling RL Environments for Terminal Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.16443) | arXiv 2026 |\n| [Reflexion: Language Agents with Verbal Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=vAElhFcKW6) | NeurIPS 2023 |\n| [CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution](https:\u002F\u002Faclanthology.org\u002F2025.acl-long.1158\u002F) | ACL 2025 |\n| [InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F4b175d846fb008d540d233c188379ff9-Abstract-Datasets_and_Benchmarks.html) | NeurIPS 2023 |\n\n## 🛠️ Harness Mechanisms\n\nOnce code is placed inside the agent loop, the harness must decide *what to execute next*, *preserve useful state*, *expose the right tools*, and *convert failures into corrective actions*.\n\n![Harness mechanisms](figs\u002Fharness_mechanism.png)\n\n### 🗺️ Planning for Code Agents\n\nPlanning is harness control: it structures how the agent externalizes intent into executable steps, schedules interactions with code artifacts and tools, and regulates the trajectory of reasoning, execution, and revision over time.\n\n#### Linear Decomposition Planning\n\n| Paper | Venue |\n| --- | --- |\n| [A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.12856) | ICLR 2024 |\n| [ReAct: Synergizing Reasoning and Acting in Language Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=WE_vluYUL-X) | ICLR 2023 |\n| [Self-planning Code Generation with Large Language Models](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3672456) | TOSEM 2024 |\n| [Knowledge-Aware Code Generation with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.15940) | arXiv 2024 |\n| [PaT: Planning-after-Trial for Efficient Test-Time Code Generation](https:\u002F\u002Fopenreview.net\u002Fforum?id=767aZTpsIl) | 2025 |\n| [A Little Help Goes a Long Way: Tutoring LLMs in Solving Competitive Programming through Hints](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F11181219\u002F) | TSE 2025 |\n\n#### Structure-Grounded Planning\n\n| Paper | Venue |\n| --- | --- |\n| [RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.16198) | ICLR 2026 |\n| [Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16901) | arXiv 2025 |\n| [DomAgent: Leveraging Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.21430) | AAMAS 2026 |\n| [CodePlan: Repository-Level Coding Using LLMs and Planning](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3643757) | FSE 2024 |\n| [LocAgent: Graph-Guided LLM Agents for Code Localization](https:\u002F\u002Faclanthology.org\u002F2025.acl-long.426\u002F) | ACL 2025 |\n| [VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F32007) | AAAI 2025 |\n\n#### Search-Based Planning\n\n| Paper | Venue |\n| --- | --- |\n| [Planning in Natural Language Improves LLM Search for Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03733) | ICLR 2025 |\n| [Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15305) | ACL 2025 Findings |\n| [Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs](https:\u002F\u002Fopenreview.net\u002Fforum?id=sYk6ZMmrOz) | NeurIPS 2025 |\n| [Meta-Harness: End-to-End Optimization of Model Harnesses](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.28052) | arXiv 2026 |\n| [DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal](https:\u002F\u002Faclanthology.org\u002F2025.acl-long.973\u002F) | ACL 2025 |\n| [Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002F6f479ea488e0908ac8b1b37b27fd134c-Abstract-Conference.html) | NeurIPS 2024 |\n| [CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models](https:\u002F\u002Faclanthology.org\u002F2025.naacl-long.189\u002F) | NAACL 2025 |\n| [RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation](https:\u002F\u002Faclanthology.org\u002F2025.emnlp-main.410\u002F) | EMNLP 2025 |\n| [SFS: Smarter Code Space Search Improves LLM Inference Scaling](https:\u002F\u002Fopenreview.net\u002Fforum?id=MCHuGOkExF) | ICLR 2025 |\n\n#### Orchestration-Based Planning\n\n| Paper | Venue |\n| --- | --- |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10737) | arXiv 2024 |\n| [CodeCoR: An LLM-based self-reflective multi-agent framework for code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07811) | arXiv 2025 |\n| [Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.03902) | arXiv 2025 |\n| [SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.23647) | arXiv 2026 |\n| [Requirements Development and Formalization for Reliable Code Generation: A Multi-Agent Vision](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18675) | ASE 2025 |\n| [AlgoForge: Specializing Code Generation Agents through Collaborative Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=KwqbtKeaRl) | 2025 |\n| [MapCoder: Multi-Agent Code Generation for Competitive Problem Solving](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.269\u002F) | ACL 2024 |\n| [Blueprint2Code: a multi-agent pipeline for reliable code generation via blueprint planning and repair](https:\u002F\u002Fwww.frontiersin.org\u002Fjournals\u002Fartificial-intelligence\u002Farticles\u002F10.3389\u002Ffrai.2025.1660912\u002Ffull) | Frontiers in AI 2025 |\n| [AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3664647.3681010) | ACM MM 2024 |\n### 🧠 Memory and Context Engineering\n\nMemory in code-as-agent-harness systems is a state-management layer: which information stays in the active context, which is compacted, and which is offloaded to durable external storage.\n\n#### Working Memory\n\n| Paper | Venue |\n| --- | --- |\n| [On the Failure of Latent State Persistence in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10571) | arXiv 2025 |\n| [Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.13646) | arXiv 2025 |\n| [CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.15813) | arXiv 2025 |\n| [RepairAgent: An Autonomous, LLM-Based Agent for Program Repair](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1109\u002FICSE55347.2025.00157) | ICSE 2025 |\n| [Agentless: Demystifying LLM-based Software Engineering Agents](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3715754) | FSE 2025 |\n| [SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering](https:\u002F\u002Fopenreview.net\u002Fforum?id=mXpq6ut8J3) | NeurIPS 2024 |\n\n#### Semantic Memory\n\n| Paper | Venue |\n| --- | --- |\n| [From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15965) | arXiv 2025 |\n| [Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06052) | arXiv 2026 |\n| [AgentSM: Semantic Memory for Agentic Text-to-SQL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.15709) | arXiv 2026 |\n| [A Survey on Large Language Models for Code Generation](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3747588) | TOSEM 2026 |\n| [RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.151\u002F) | EMNLP 2023 |\n| [AutoCodeRover: Autonomous Program Improvement](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3650212.3680384) | ISSTA 2024 |\n| [CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.737\u002F) | ACL 2024 |\n| [A Survey on the Memory Mechanism of Large Language Model-Based Agents](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3748302) | TOIS 2025 |\n#### Experiential Memory\n\n| Paper | Venue |\n| --- | --- |\n| [Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20857) | arXiv 2025 |\n| [MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.06789) | arXiv 2026 |\n| [Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13244) | arXiv 2024 |\n| [Towards Large Language Models with Human-Like Episodic Memory](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fabs\u002Fpii\u002FS1364661325001792) | Trends in Cognitive Sciences 2025 |\n| [Episodic Memories Generation and Evaluation Benchmark for Large Language Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=6ycX677p2l) | ICLR 2025 |\n| [ExpeL: LLM Agents Are Experiential Learners](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F29936) | AAAI 2024 |\n\n#### Long-Term Memory\n\n| Paper | Venue |\n| --- | --- |\n| [Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04257) | arXiv 2026 |\n| [Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.03515) | arXiv 2026 |\n| [MemGPT: Towards LLMs as Operating Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08560) | arXiv 2023 |\n| [Your Code Agent Can Grow Alongside You with Structured Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.13258) | arXiv 2026 |\n| [TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.23010) | arXiv 2025 |\n| [Memory OS of AI Agent](https:\u002F\u002Faclanthology.org\u002F2025.emnlp-main.1318\u002F) | EMNLP 2025 |\n| [Evaluating Very Long-Term Conversational Memory of LLM Agents](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.747\u002F) | ACL 2024 |\n\n#### Multi-Agent Memory\n\n| Paper | Venue |\n| --- | --- |\n| [SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23348) | ICSE 2026 |\n| [GameGPT: Multi-agent Collaborative Framework for Game Development](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08067) | arXiv 2023 |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [MIRIX: Multi-Agent Memory System for LLM-Based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.07957) | arXiv 2025 |\n| [Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02183) | arXiv 2024 |\n| [Compressing Code Context for LLM-based Issue Resolution](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.28119) | arXiv 2026 |\n| [Scaling Long-Horizon LLM Agent via Context-Folding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.11967) | arXiv 2025 |\n| [LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.14337) | arXiv 2026 |\n| [SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?](https:\u002F\u002Fopenreview.net\u002Fforum?id=VTF8yNQM66) | ICLR 2024 |\n| [G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems](https:\u002F\u002Fopenreview.net\u002Fforum?id=mmIAp3cVS0) | NeurIPS 2025 |\n\n### 🔧 Tool Usage for Code Agents\n\nTool usage is the action and observation layer of the code-agent harness: agents search repositories, inspect files, edit code, run commands, execute tests, call APIs, and verify intermediate results — all under typed schemas, sandboxes, and lifecycle hooks.\n\n#### Function-Oriented Tool Use\n\n| Paper | Venue |\n| --- | --- |\n| [ToolCoder: Teach Code Generation Models to use API search tools](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.04032) | arXiv 2023 |\n| [CodeQA: Advanced Programming Question-Answering Using LLM Agent and RAG](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10753267) | IEEE TENCON 2024 |\n| [RAG-Based AI Agents for Enterprise Software Development: Implementation Patterns and Production Deployment](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F399509219_RAG-Based_AI_Agents_for_Enterprise_Software_Development_Implementation_Patterns_and_Production_Deployment) | 2025 |\n| [The Devil Is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10298393\u002F) | ASE 2023 |\n\n#### Environment-Interaction Tool Use\n\n| Paper | Venue |\n| --- | --- |\n| [Environment-in-the-Loop: Rethinking Code Migration with LLM-based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.09944) | arXiv 2026 |\n| [Test-Time Adaptation for LLM Agents via Environment Interaction](https:\u002F\u002Fopenreview.net\u002Fforum?id=OH4PE0TDo0) | ICLR 2026 |\n\n#### Verification-Driven Tool Use\n\n| Paper | Venue |\n| --- | --- |\n| [VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.05156) | arXiv 2025 |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14209) | arXiv 2025 |\n\n#### Workflow-Orchestration Tool Use\n\n| Paper | Venue |\n| --- | --- |\n| [ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00839) | arXiv 2024 |\n| [ControlLLM: Augment Language Models with Tools by Searching on Graphs](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-031-73254-6_6) | ECCV 2024 |\n| [Agent Harness for Large Language Model Agents: A Survey](https:\u002F\u002Fwww.preprints.org\u002Fmanuscript\u002F202604.0428\u002Fv1) | Preprints 2026 |\n| [Executable Code Actions Elicit Better LLM Agents](https:\u002F\u002Fopenreview.net\u002Fforum?id=8oJyuXfrPv) | ICML 2024 |\n| [OpenHands: An Open Platform for AI Software Developers as Generalist Agents](https:\u002F\u002Fopenreview.net\u002Fforum?id=OJd3ayDDoF) | ICLR 2025 |\n| [On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3798166) | TOSEM 2025 |\n\n### 🧪 Feedback-Guided Iterative Debugging\n\nIterative debugging closes the harness loop: development environments expose feedback (compiler diagnostics, runtime errors, tests, critique), and the agent transforms these signals into diagnosis, revision, and progressively better debugging behavior.\n\n#### Development Environments for Agentic Coding\n\n##### Contextual Environments for Repository-Aware Generation\n\n| Paper | Venue |\n| --- | --- |\n| [On the Impacts of Contexts on Repository-Level Code Generation](https:\u002F\u002Faclanthology.org\u002F2025.findings-naacl.82\u002F) | NAACL 2025 Findings |\n| [A Survey on Model Context Protocol: Architecture, State-of-the-art, Challenges and Future Directions](https:\u002F\u002Fdoi.org\u002F10.36227\u002Ftechrxiv.174495492.22752319\u002Fv1) | TechRxiv 2025 |\n| [CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03910) | NAACL 2025 |\n| [RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation](https:\u002F\u002Faclanthology.org\u002F2024.emnlp-demo.46\u002F) | EMNLP 2024 (Demo) |\n| [Knowledge Graph Based Repository-Level Code Generation](http:\u002F\u002Fdx.doi.org\u002F10.1109\u002FLLM4Code66737.2025.00026) | LLM4Code@ICSE 2025 |\n| [From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03864) | arXiv 2025 |\n| [Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.04905) | arXiv 2026 |\n| [A³-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10734067\u002F) | TSE 2024 |\n\n##### Interactive Environments for Human–LLM Collaboration\n\n| Paper | Venue |\n| --- | --- |\n| [Conversational AI as a Coding Assistant: Understanding Programmers' Interactions with and Expectations from Large Language Models for Coding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16508) | arXiv 2025 |\n| [The Design Space of LLM-Based AI Coding Assistants: An Analysis of 90 Systems in Academia and Industry](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F11303497\u002F) | VL\u002FHCC 2025 |\n| [Language Server Protocol: Defines a Common Protocol for Language Servers](https:\u002F\u002Fgithub.com\u002FMicrosoft\u002Flanguage-server-protocol) \\[Spec\\] | — |\n| [Deductive Verification via the Debug Adapter Protocol](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.02968) | arXiv 2021 |\n| [Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3796519) | TOSEM 2025 |\n| [The Programmer's Assistant: Conversational Interaction with a Large Language Model for Software Development](https:\u002F\u002Fdoi.org\u002F10.1145\u002F3581641.3584037) | IUI 2023 |\n| [Human-AI Experience in Integrated Development Environments: A Systematic Literature Review](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs10664-025-10793-0) | Empirical Software Engineering 2026 |\n\n##### Execution and Validation Environments\n\n| Paper | Venue |\n| --- | --- |\n| [RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07358) | arXiv 2025 |\n| [Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05710) | arXiv 2025 |\n| [FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.06939) | arXiv 2026 |\n| [LLMLOOP: Improving LLM-Generated Code and Tests Through Automated Iterative Feedback Loops](https:\u002F\u002Fdoi.org\u002F10.1109\u002FICSME64153.2025.00109) | ICSME 2025 |\n| [Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety](https:\u002F\u002Fopenreview.net\u002Fforum?id=xggSxCFQbA) | ICLR 2026 |\n| [Kubeintellect: A modular llm-orchestrated agent framework for end-to-end kubernetes management](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02449) | arXiv 2025 |\n| [MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios](https:\u002F\u002Faclanthology.org\u002F2025.findings-acl.305\u002F) | ACL 2025 Findings |\n| [ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?](https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.859\u002F) | EMNLP 2024 |\n\n##### Engineering Platforms for Deployment and Workflow Integration\n\n| Paper | Venue |\n| --- | --- |\n| [LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead](https:\u002F\u002Fdoi.org\u002F10.1145\u002F3712003) | TOSEM 2024 |\n| [AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19902) | arXiv 2025 |\n| [ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.03463) | arXiv 2025 |\n| [From challenges to metrics: An LLM-driven DevOps recommendation system grounded in evidence-based mappings](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2590005625001742) | Array 2025 |\n| [AI Augmented CI\u002FCD Pipelines: From Code Commit to Production with Autonomous Decisions](http:\u002F\u002Fdx.doi.org\u002F10.1109\u002FFLLM67465.2025.11391007) | IEEE FLLM 2025 |\n| [A Multi-Agent Coding Assistant for Cloud-Native Development: From Requirements to Deployable Microservices](https:\u002F\u002Fdoi.org\u002F10.20944\u002Fpreprints202512.1922.v1) | Preprints 2025 |\n| [Continuous QoS-compliant Orchestration in the Cloud-Edge Continuum](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02985) | Software: Practice and Experience 2024 |\n| [From Code Generation to AI Collaboration: The Role of Multi-Agent Systems in Software Engineering](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F388835330_From_Code_Generation_to_AI_Collaboration_The_Role_of_Multi-Agent_Systems_in_Software_Engineering) | 2025 |\n| [AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations](https:\u002F\u002Fopenreview.net\u002Fforum?id=BAakY1hNKS) | COLM 2024 |\n#### Feedback Mechanisms for Iterative Debugging\n\n##### Compilation and Static-Analysis Feedback\n\n| Paper | Venue |\n| --- | --- |\n| [The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18403) | arXiv 2025 |\n| [Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14841) | Discover Artificial Intelligence 2024 |\n| [Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02133) | arXiv 2025 |\n| [Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.138\u002F) | ACL 2024 Findings |\n| [Static Analysis as a Feedback Loop: Enhancing LLM-Generated Code Beyond Correctness](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.14419) | arXiv 2025 |\n\n##### Runtime Error and Exception Feedback\n\n| Paper | Venue |\n| --- | --- |\n| [Towards Agentic Runtime Healing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.01055) | arXiv 2024 |\n| [Large Language Model Guided Self-Debugging Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02928) | arXiv 2025 |\n| [Code Repair with LLMs gives an Exploration-Exploitation Tradeoff](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002Fd5c56ec4f69c9a473089b16000d3f8cd-Abstract-Conference.html) | NeurIPS 2024 |\n| [Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.49\u002F) | ACL 2024 Findings |\n\n##### Test-Based Execution Feedback\n\n| Paper | Venue |\n| --- | --- |\n| [Teaching Large Language Models to Self-Debug](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.05128) | arXiv 2023 |\n| [Learning to generate unit tests for automated debugging](https:\u002F\u002Fopenreview.net\u002Fpdf?id=yeVBHPLXxi) | COLM 2025 |\n| [TestART: Improving LLM-Based Unit Testing via Co-Evolution of Automated Generation and Repair Iteration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03095) | arXiv 2024 |\n| [From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01215) | ICSE 2026 |\n| [Revisit Self-Debugging with Self-Generated Tests for Code Generation](https:\u002F\u002Faclanthology.org\u002F2025.acl-long.881\u002F) | ACL 2025 |\n| [LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1109\u002FTSE.2024.3428972) | TSE 2024 |\n\n##### Critique-Driven Feedback (Human or Auxiliary Agents)\n\n| Paper | Venue |\n| --- | --- |\n| [Interactive Debugging and Steering of Multi-Agent AI Systems](https:\u002F\u002Fdoi.org\u002F10.1145\u002F3706598.3713581) | CHI 2025 |\n| [RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance](https:\u002F\u002Fdoi.org\u002F10.1109\u002FICA63002.2024.00037) | International Conference on Agents 2024 |\n\n##### Feedback-Driven Debugging and Self-Improvement\n\n| Paper | Venue |\n| --- | --- |\n| [Teaching Your Models to Understand Code via Focal Preference Alignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02783) | arXiv 2025 |\n| [ReVeal: Self-Evolving Code Agents via Reliable Self-Verification](https:\u002F\u002Fopenreview.net\u002Fforum?id=q56ZI1Co43) | NeurIPS 2025 |\n\n## 👥 Scaling the Harness: Multi-Agent Code-Centric Systems\n\nWhen multiple agents operate over code, the harness must coordinate roles, share intermediate artifacts, maintain common state, and verify collective progress through repositories, tests, traces, and structured workflows.\n\n![Scaling the harness](figs\u002Fscaling_harness.png)\n\n### 🎭 Functional Role Specialization\n\nDistinct agents own slices of the shared code harness — synthesis, understanding, verification, execution, and planning.\n\n#### Program Synthesis Agents\n\n| Paper | Venue |\n| --- | --- |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework](https:\u002F\u002Fopenreview.net\u002Fforum?id=VtmBAGCN7o) | ICLR 2024 |\n| [ChatDev: Communicative Agents for Software Development](https:\u002F\u002Fdoi.org\u002F10.18653\u002Fv1\u002F2024.acl-long.810) | ACL 2024 |\n| [MAGE: A multi-agent engine for automated RTL code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07822) | DAC 2025 |\n| [Self-collaboration Code Generation via ChatGPT](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3672459) | TOSEM 2024 |\n\n#### Program Understanding Agents\n\n| Paper | Venue |\n| --- | --- |\n| [HyperAgent: Generalist software engineering agents to solve coding tasks at scale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.16299) | arXiv 2024 |\n| [Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3728981) | ISSTA 2025 |\n| [CleanAgent: Automating data standardization with LLM-based agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.08291) | arXiv 2024 |\n| [MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution](https:\u002F\u002Fopenreview.net\u002Fforum?id=qevq3FZ63J) | NeurIPS 2024 |\n\n#### Verification Agents\n\n| Paper | Venue |\n| --- | --- |\n| [QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17167) | arXiv 2025 |\n| [AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10737) | arXiv 2024 |\n| [Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02943) | arXiv 2025 |\n\n#### Execution Agents\n\n| Paper | Venue |\n| --- | --- |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [HyperAgent: Generalist software engineering agents to solve coding tasks at scale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.16299) | arXiv 2024 |\n| [MAGE: A multi-agent engine for automated RTL code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07822) | DAC 2025 |\n\n#### Planning Agents\n\n| Paper | Venue |\n| --- | --- |\n| [Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02183) | arXiv 2024 |\n| [Self-Evolving Multi-Agent Collaboration Networks for Software Development](https:\u002F\u002Fopenreview.net\u002Fforum?id=4R71pdPBZp) | ICLR 2025 |\n| [SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1109\u002FICSE55347.2025.00140) | ICSE 2025 |\n\n### 💬 Interaction Modes\n\nCode-centric multi-agent interaction is artifact-mediated: agents observe and modify shared code, and grounding comes from the objective state exposed by execution.\n\n#### Collaborative Synthesis\n\n| Paper | Venue |\n| --- | --- |\n| [CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01411) | arXiv 2024 |\n| [A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3691620.3695506) | ASE 2024 |\n\n#### Critique and Repair\n\n| Paper | Venue |\n| --- | --- |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [SEW: Self-evolving agentic workflows for automated code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18646) | arXiv 2025 |\n\n#### Adversarial Validation\n\n| Paper | Venue |\n| --- | --- |\n| [AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10737) | arXiv 2024 |\n| [MAGE: A multi-agent engine for automated RTL code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07822) | DAC 2025 |\n\n#### Reasoning Debate\n\n| Paper | Venue |\n| --- | --- |\n| [ChatDev: Communicative Agents for Software Development](https:\u002F\u002Fdoi.org\u002F10.18653\u002Fv1\u002F2024.acl-long.810) | ACL 2024 |\n| [Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02943) | arXiv 2025 |\n\n### 🕸️ Workflow Topology\n\nTopology of agent interaction (chain, cyclic, hierarchical, star, adaptive) is one of the most consequential design decisions in multi-agent code generation.\n\n#### Pre-Defined Heuristic Topologies (Waterfall \u002F Iterative \u002F Hierarchical \u002F Star)\n\n| Paper | Venue |\n| --- | --- |\n| [ChatDev: Communicative Agents for Software Development](https:\u002F\u002Fdoi.org\u002F10.18653\u002Fv1\u002F2024.acl-long.810) | ACL 2024 |\n| [MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework](https:\u002F\u002Fopenreview.net\u002Fforum?id=VtmBAGCN7o) | ICLR 2024 |\n| [L2MAC: Large language model automatic computer for extensive code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02003) | ICLR 2024 |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [MAGE: A multi-agent engine for automated RTL code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07822) | DAC 2025 |\n| [HyperAgent: Generalist software engineering agents to solve coding tasks at scale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.16299) | arXiv 2024 |\n| [Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02183) | arXiv 2024 |\n| [Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02943) | arXiv 2025 |\n\n#### Objective-Driven and Adaptive Topologies\n\n| Paper | Venue |\n| --- | --- |\n| [FlowReasoner: Reinforcing Query-Level Meta-Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15257) | arXiv 2025 |\n| [BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.23631) | arXiv 2025 |\n| [SEW: Self-evolving agentic workflows for automated code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18646) | arXiv 2025 |\n\n### ⚡ Execution Feedback Integration\n\nCode is uniquely executable, producing objective oracle signals that anchor multi-agent coordination.\n\n#### Compiler and Syntax Feedback\n\n| Paper | Venue |\n| --- | --- |\n| [ChatDev: Communicative Agents for Software Development](https:\u002F\u002Fdoi.org\u002F10.18653\u002Fv1\u002F2024.acl-long.810) | ACL 2024 |\n| [L2MAC: Large language model automatic computer for extensive code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02003) | ICLR 2024 |\n\n#### Test Pass\u002FFail Signals\n\n| Paper | Venue |\n| --- | --- |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17167) | arXiv 2025 |\n\n#### Fuzzer Crash Traces\n\n| Paper | Venue |\n| --- | --- |\n| [AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10737) | arXiv 2024 |\n\n#### Static Analysis Warnings\n\n| Paper | Venue |\n| --- | --- |\n| [AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10737) | arXiv 2024 |\n\n#### Performance Profiling Results\n\n| Paper | Venue |\n| --- | --- |\n| [MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03906) | arXiv 2025 |\n\n#### Fine-Grained Simulation Feedback\n\n| Paper | Venue |\n| --- | --- |\n| [MAGE: A multi-agent engine for automated RTL code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07822) | DAC 2025 |\n\n### 🔄 Shared-Harness Synchronization\n\nHow multi-agent systems maintain a consistent shared view of program state.\n\n#### Shared Blackboard\n\n| Paper | Venue |\n| --- | --- |\n| [L2MAC: Large language model automatic computer for extensive code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02003) | ICLR 2024 |\n\n#### Parallel Branches with Merge\n\n| Paper | Venue |\n| --- | --- |\n| [HyperAgent: Generalist software engineering agents to solve coding tasks at scale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.16299) | arXiv 2024 |\n\n#### Structured Context Scheduling\n\n| Paper | Venue |\n| --- | --- |\n| [MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework](https:\u002F\u002Fopenreview.net\u002Fforum?id=VtmBAGCN7o) | ICLR 2024 |\n\n#### Hierarchical Memory\n\n| Paper | Venue |\n| --- | --- |\n| [ChatDev: Communicative Agents for Software Development](https:\u002F\u002Fdoi.org\u002F10.18653\u002Fv1\u002F2024.acl-long.810) | ACL 2024 |\n| [Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18653) | arXiv 2025 |\n\n#### Agent Pool Scaling\n\n| Paper | Venue |\n| --- | --- |\n| [Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02183) | arXiv 2024 |\n\n### 🏛️ Shared Harness Representation\n\nFour levels of formalization for the shared substrate: implicit\u002Ffile-only, repository-based, execution-based, and blackboard.\n\n#### Implicit \u002F File-Only Representation\n\n| Paper | Venue |\n| --- | --- |\n| [ChatDev: Communicative Agents for Software Development](https:\u002F\u002Fdoi.org\u002F10.18653\u002Fv1\u002F2024.acl-long.810) | ACL 2024 |\n| [MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework](https:\u002F\u002Fopenreview.net\u002Fforum?id=VtmBAGCN7o) | ICLR 2024 |\n| [CodeCoR: An LLM-based self-reflective multi-agent framework for code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07811) | arXiv 2025 |\n| [SEW: Self-evolving agentic workflows for automated code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18646) | arXiv 2025 |\n| [CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01411) | arXiv 2024 |\n| [SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering](https:\u002F\u002Fopenreview.net\u002Fforum?id=6TDSDdgP7Z) | ICML 2025 |\n\n#### Repository-Based Representation\n\n| Paper | Venue |\n| --- | --- |\n| [HyperAgent: Generalist software engineering agents to solve coding tasks at scale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.16299) | arXiv 2024 |\n| [Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3728981) | ISSTA 2025 |\n#### Execution-Based Representation\n\n| Paper | Venue |\n| --- | --- |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10737) | arXiv 2024 |\n| [QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17167) | arXiv 2025 |\n| [MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03906) | arXiv 2025 |\n| [Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02943) | arXiv 2025 |\n| [MAGE: A multi-agent engine for automated RTL code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07822) | DAC 2025 |\n\n#### Blackboard \u002F Shared-State Representation\n\n| Paper | Venue |\n| --- | --- |\n| [L2MAC: Large language model automatic computer for extensive code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02003) | ICLR 2024 |\n| [GameGPT: Multi-agent Collaborative Framework for Game Development](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08067) | arXiv 2023 |\n| [Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18653) | arXiv 2025 |\n\u003C!-- | [The Hearsay-II Speech-Understanding System: Integrating Knowledge to Resolve Uncertainty](https:\u002F\u002Fdoi.org\u002F10.1145\u002F356810.356816) | CSUR 1980 | -->\n\n### 🎯 Harness-State Convergence\n\nHow a multi-agent code system decides the shared harness has reached an acceptable final state.\n\n#### Correctness Convergence (Test-Gated)\n\n| Paper | Venue |\n| --- | --- |\n| [AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010) | arXiv 2023 |\n| [L2MAC: Large language model automatic computer for extensive code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02003) | ICLR 2024 |\n| [Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02943) | arXiv 2025 |\n\n#### Security Convergence\n\n| Paper | Venue |\n| --- | --- |\n| [AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10737) | arXiv 2024 |\n\n#### Performance Convergence\n\n| Paper | Venue |\n| --- | --- |\n| [MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03906) | arXiv 2025 |\n\n#### Score-Based Convergence\n\n| Paper | Venue |\n| --- | --- |\n| [MAGE: A multi-agent engine for automated RTL code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07822) | DAC 2025 |\n| [CodeCoR: An LLM-based self-reflective multi-agent framework for code generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07811) | arXiv 2025 |\n| [Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23370) | arXiv 2025 |\n\n#### Consensus Convergence\n\n| Paper | Venue |\n| --- | --- |\n| [QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17167) | arXiv 2025 |\n\n#### Implicit Convergence\n\n| Paper | Venue |\n| --- | --- |\n| [ChatDev: Communicative Agents for Software Development](https:\u002F\u002Fdoi.org\u002F10.18653\u002Fv1\u002F2024.acl-long.810) | ACL 2024 |\n| [MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework](https:\u002F\u002Fopenreview.net\u002Fforum?id=VtmBAGCN7o) | ICLR 2024 |\n\n## 🚀 Applications and Emerging Fields\n\nCode-centric agentic systems become operational in tangible domains where code defines observable state, executable actions, persistent memory, and feedback signals.\n\n![Applications](figs\u002Fapplications.png)\n\n### 💻 Code Assistants\n\nRepositories, tests, issue threads, and development tools form a persistent program world; assistants act over it as code-centric agents.\n\n#### The Repository as a Persistent Program World\n\n| Paper | Venue |\n| --- | --- |\n| [RepoCoder: Repository-Level Code Completion through Iterative Retrieval and Generation](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.151\u002F) | EMNLP 2023 |\n| [CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03910) | NAACL 2025 |\n| [AutoCodeRover: Autonomous Program Improvement](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05427) | ISSTA 2024 |\n\n#### Agent Harnesses as Executable Development Interfaces\n\n| Paper | Venue |\n| --- | --- |\n| [Claude Code](https:\u002F\u002Fwww.anthropic.com\u002Fproduct\u002Fclaude-code) \\[Blog\\] | 2025 |\n| [Introducing Codex](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-codex\u002F) \\[Blog\\] | 2025 |\n| [About GitHub Copilot Cloud Agent](https:\u002F\u002Fdocs.github.com\u002Fcopilot\u002Fconcepts\u002Fagents\u002Fcoding-agent\u002Fabout-coding-agent) \\[Blog\\] | 2025 |\n| [DeepAgents](https:\u002F\u002Fgithub.com\u002Flangchain-ai\u002Fdeepagents) \\[GitHub\\] | 2025 |\n| [Model Context Protocol](https:\u002F\u002Fdocs.anthropic.com\u002Fen\u002Fdocs\u002Fagents-and-tools\u002Fmcp) \\[Docs\\] | 2024 |\n| [Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23278) | ACM TOSEM 2025 |\n| [The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.03690) | arXiv 2025 |\n| [AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.03329) | arXiv 2026 |\n| [Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.25850) | arXiv 2026 |\n| [Meta-Harness: End-to-End Optimization of Model Harnesses](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.28052) | arXiv 2026 |\n| [Natural-Language Agent Harnesses](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25723) | arXiv 2026 |\n\n#### Execution Feedback as Grounded Verification\n\n| Paper | Venue |\n| --- | --- |\n| [Agentless: Demystifying LLM-based Software Engineering Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01489) | arXiv 2024 |\n| [RepairAgent: An Autonomous, LLM-Based Agent for Program Repair](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17134) | ICSE 2025 |\n| [Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.13646) | arXiv 2025 |\n| [Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.08500) | arXiv 2024 |\n\n#### Memory and Context Management at Repository Scale\n\n| Paper | Venue |\n| --- | --- |\n| [RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation](https:\u002F\u002Faclanthology.org\u002F2024.emnlp-demo.46\u002F) | EMNLP 2024 (Demo) |\n| [ContextBench: A Benchmark for Context Retrieval in Coding Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.05892) | arXiv 2026 |\n| [CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.15813) | arXiv 2025 |\n| [MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.06789) | arXiv 2026 |\n\n#### Developer Intent and Project Conventions as Latent State\n\n| Paper | Venue |\n| --- | --- |\n| [Learning to Commit: Generating Organic Pull Requests via Online Repository Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.26664) | arXiv 2026 |\n| [CodeTaste: Can LLMs Generate Human-Level Code Refactorings?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04177) | arXiv 2026 |\n| [SWE-bench+: Enhanced Coding Benchmark for LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06992) | ICSE Companion 2025 |\n\n#### From Inline Completion to Autonomous SWE Agents\n\n| Paper | Venue |\n| --- | --- |\n| [Evaluating Large Language Models Trained on Code](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.03374) | arXiv 2021 |\n| [The Impact of AI on Developer Productivity: Evidence from GitHub Copilot](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.06590) | arXiv 2023 |\n| [Expectation vs.\\ Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models](https:\u002F\u002Fdoi.org\u002F10.1145\u002F3491101.3519665) | CHI Extended Abstracts 2022 |\n| [Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.14306) | CHI 2024 |\n\n#### From Patch Generation to Software Lifecycle Participation\n\n| Paper | Venue |\n| --- | --- |\n| [SWE-bench: Can Language Models Resolve Real-world Github Issues?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06770) | ICLR 2024 |\n| [SWE-lancer: Can frontier LLMs earn \\$1 million from real-world freelance software engineering?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12115) | ICML 2025 |\n| [SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.16941) | arXiv 2025 |\n| [Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.11868) | arXiv 2026 |\n| [AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.874\u002F) | ACL 2024 |\n| [τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https:\u002F\u002Fopenreview.net\u002Fforum?id=roNSXZpUDN) | ICLR 2025 |\n| [AI Augmented CI\u002FCD Pipelines: From Code Commit to Production with Autonomous Decisions](http:\u002F\u002Fdx.doi.org\u002F10.1109\u002FFLLM67465.2025.11391007) | IEEE FLLM 2025 |\n| [Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.11655) | arXiv 2026 |\n| [Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3696630.3728549) | FSE 2025 |\n| [CodeAgent: Autonomous Communicative Agents for Code Review](https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.632\u002F) | EMNLP 2024 |\n\n#### Multi-Agent Code Assistance and Shared Repositories\n\n| Paper | Venue |\n| --- | --- |\n| [ChatDev: Communicative Agents for Software Development](https:\u002F\u002Fdoi.org\u002F10.18653\u002Fv1\u002F2024.acl-long.810) | ACL 2024 |\n| [MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework](https:\u002F\u002Fopenreview.net\u002Fforum?id=VtmBAGCN7o) | ICLR 2024 |\n| [CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-Level Coding Challenges](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.757\u002F) | ACL 2024 |\n| [METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04567) | ACL 2025 |\n\n#### The Harness as a Distillation Surface\n\n| Paper | Venue |\n| --- | --- |\n| [Composer: Building a fast frontier model with reinforcement learning](https:\u002F\u002Fcursor.com\u002Fblog\u002Fcomposer) \\[Blog\\] | 2025 |\n| [Improving Composer through real-time reinforcement learning](https:\u002F\u002Fcursor.com\u002Fblog\u002Freal-time-rl-for-composer) \\[Blog\\] | 2025 |\n| [Addendum to GPT-5 system card: GPT-5-Codex](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002F97cc5669-7a25-4e63-b15f-5fd5bdc4d149\u002Fgpt-5-codex-system-card.pdf) \\[Report\\] | 2025 |\n| [Building more with GPT-5.1-Codex-Max](https:\u002F\u002Fopenai.com\u002Findex\u002Fgpt-5-1-codex-max\u002F) \\[Blog\\] | 2025 |\n| [How Anthropic teams use Claude Code](https:\u002F\u002Fwww-cdn.anthropic.com\u002F58284b19e702b49db9302d5b6f135ad8871e7658.pdf) \\[Report\\] | 2025 |\n\n#### Open Challenges for Code-Assistant Harnesses\n\n| Paper | Venue |\n| --- | --- |\n| [Are \"Solved Issues\" in SWE-bench Really Solved Correctly? An Empirical Study](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15223) | arXiv 2025 |\n| [SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.17419) | arXiv 2025 |\n| [Introducing Aardvark: OpenAI's Agentic Security Researcher](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-aardvark\u002F) \\[Blog\\] | 2025 |\n| [Codex Security: Now in Research Preview](https:\u002F\u002Fopenai.com\u002Findex\u002Fcodex-security-now-in-research-preview\u002F) \\[Blog\\] | 2026 |\n| [Why Do Multi-Agent LLM Systems Fail?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13657) | arXiv 2025 |\n| [Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00212) | arXiv 2025 |\n| [AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03312) | arXiv 2025 |\n| [Where LLM Agents Fail and How They Can Learn from Failures](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25370) | arXiv 2025 |\n| [Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.11839) | arXiv 2026 |\n| [Fault-Tolerant Sandboxing for AI Coding Agents: A Transactional Approach to Safe Autonomous Execution](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.12806) | arXiv 2025 |\n| [Introducing the Agent Governance Toolkit: Open-Source Runtime Security for AI Agents](https:\u002F\u002Fopensource.microsoft.com\u002Fblog\u002F2026\u002F04\u002F02\u002Fintroducing-the-agent-governance-toolkit-open-source-runtime-security-for-ai-agents\u002F) \\[Blog\\] | 2026 |\n\n### 🖥️ GUI \u002F OS Agents\n\nGUI\u002FOS environments are program worlds in the most literal sense: every observation is rendered code, and every action is a call into another piece of code.\n\n#### GUI\u002FOS as a Partially Observable Program World\n\n| Paper | Venue |\n| --- | --- |\n| [WebArena: A Realistic Web Environment for Building Autonomous Agents](https:\u002F\u002Fopenreview.net\u002Fforum?id=oKn9c6ytLx) | ICLR 2024 |\n| [Mind2Web: Towards a Generalist Agent for the Web](https:\u002F\u002Fopenreview.net\u002Fforum?id=kiYqbO3wqw) | NeurIPS 2023 |\n| [AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents](https:\u002F\u002Fproceedings.iclr.cc\u002Fpaper_files\u002Fpaper\u002F2025\u002Fhash\u002F01a83bc2f2732a58e6aa731e659e7101-Abstract-Conference.html) | ICLR 2025 |\n| [Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale](https:\u002F\u002Fproceedings.mlr.press\u002Fv267\u002Fbonatti25a.html) | ICML 2025 |\n| [AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents](https:\u002F\u002Fopenreview.net\u002Fforum?id=oWdzUpOlkX) | ICLR 2025 |\n| [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https:\u002F\u002Fproceedings.mlr.press\u002Fv235\u002Fzheng24e.html) | ICML 2024 |\n| [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.371\u002F) | ACL 2024 |\n| [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https:\u002F\u002Fopenreview.net\u002Fforum?id=tN61DTr4Ed) | NeurIPS 2024 |\n| [Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11441) | arXiv 2023 |\n| [WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?](https:\u002F\u002Fproceedings.mlr.press\u002Fv235\u002Fdrouin24a.html) | ICML 2024 |\n| [CogAgent: A Visual Language Model for GUI Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08914) | CVPR 2024 |\n#### Unifying Perception, Action, and Evaluation Through Code\n\n| Paper | Venue |\n| --- | --- |\n| [Executable Code Actions Elicit Better LLM Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01030) | ICML 2024 |\n| [Cradle: Empowering Foundation Agents towards General Computer Control](https:\u002F\u002Fproceedings.mlr.press\u002Fv267\u002Ftan25h.html) | ICML 2025 |\n| [TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks](https:\u002F\u002Fopenreview.net\u002Fforum?id=LZnKNApvhG) | NeurIPS 2025 |\n| [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10935) | ACL 2024 |\n| [Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05719) | ECCV 2024 |\n| [OS-ATLAS: Foundation Action Model for Generalist GUI Agents](https:\u002F\u002Fopenreview.net\u002Fforum?id=n9PDaFNi8t) | ICLR 2025 |\n| [ShowUI: One Vision-Language-Action Model for GUI Visual Agent](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.17465) | CVPR 2025 |\n| [Aria-UI: Visual Grounding for GUI Instructions](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16256) | ACL 2025 Findings |\n| [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https:\u002F\u002Fopenreview.net\u002Fforum?id=kxnoqaisCT) | ICLR 2025 |\n| [UI-TARS: Pioneering Automated GUI Interaction with Native Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12326) | arXiv 2025 |\n| [GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.22190) | arXiv 2026 |\n| [Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002Fc2f71567cd53464161cab3336e8fc865-Abstract-Datasets_and_Benchmarks_Track.html) | NeurIPS 2024 |\n#### Memory as Persistent Program State\n\n| Paper | Venue |\n| --- | --- |\n| [Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control](https:\u002F\u002Fopenreview.net\u002Fforum?","该项目是一个关于“代码作为代理程序框架”的研究综述，旨在探索代码在人工智能代理中的新角色。核心功能包括将代码视为可执行、可验证和有状态的框架，通过这个框架，代理可以进行推理、行动、环境建模、接收反馈及协调活动。技术特点主要围绕三个层次展开：接口层、机制层以及扩展性层面，覆盖了从编程助手到GUI\u002FOS自动化等多个应用方向。适用于希望深入了解或开发基于代码的人工智能系统的研究人员和技术开发者。",2,"2026-06-11 03:56:02","CREATED_QUERY"]