[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-78639":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":38,"readmeContent":39,"aiSummary":40,"trendingCount":16,"starSnapshotCount":16,"syncStatus":15,"lastSyncTime":41,"discoverSource":42},78639,"forge","antoinezambelli\u002Fforge","antoinezambelli","A Python framework for self-hosted LLM tool-calling and multi-step agentic workflows","",null,"Python",2054,142,13,2,0,49,117,364,147,108.47,"MIT License",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37],"agentic-ai","agentic-workflow","agents","function-calling","llama-cpp","llamafile","llm","ollama","python","self-hosted","tool-calling","2026-06-12 04:01:23","# forge\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fforge-guardrails.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fforge-guardrails\u002F)\n[![Tests](https:\u002F\u002Fgithub.com\u002Fantoinezambelli\u002Fforge\u002Factions\u002Fworkflows\u002Ftests.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fantoinezambelli\u002Fforge\u002Factions\u002Fworkflows\u002Ftests.yml)\n[![codecov](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fantoinezambelli\u002Fforge\u002Fbranch\u002Fmain\u002Fgraph\u002Fbadge.svg)](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fantoinezambelli\u002Fforge)\n[![Python 3.12+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.12%2B-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green.svg)](LICENSE)\n\nA reliability layer for self-hosted LLM tool-calling. You give forge a set of tools; the model calls whichever it wants in whatever order. Workflow structure is opt-in — `required_steps`, `prerequisites`, and `terminal_tool` let you constrain the loop when you need to, but forge's guardrails (rescue parsing, retry nudges, response validation) apply with zero required steps too.\n\nForge takes an 8B local model from single digits to 84% across forge's 26-scenario v0.7.0 eval suite — and even lifts Sonnet 4.6 from 85% to 98% on the same workload (Anthropic numbers measured in v0.6.0; not re-run in v0.7.0 since the cost is non-trivial).\n\n**What forge isn't:**\n- **Not an agent orchestrator.** Forge sits inside one agentic loop and makes its tool calls reliable. Multi-agent graphs, DAG planners, and cross-agent coordination are out of scope.\n- **Not a coding harness.** Forge is domain-agnostic. If you're building a coding agent (or already using one like opencode, aider, Cline), [proxy mode](#proxy-server) lifts your existing harness with forge's guardrails — no rewrite.\n\n**Three ways to use it:**\n\n- **Proxy server** — Drop-in proxy (`python -m forge.proxy`) speaking both the OpenAI chat-completions and Anthropic Messages (`\u002Fv1\u002Fmessages`) APIs, sitting between any client and a local model server. Point OpenAI-compatible tools (opencode, Continue, aider) **or Claude Code** at it and forge applies guardrails transparently — the client thinks it's talking to a smarter model. Most popular entry point.\n\n- **WorkflowRunner** — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. **SlotWorker** adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly.\n\n- **Guardrails middleware** — Use forge's reliability stack ([composable middleware](examples\u002Fforeign_loop.py)) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.\n\nSupports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends.\n\n## Requirements\n\n- Python 3.12+\n- A running LLM backend (see below)\n\n## Install\n\n```bash\npip install forge-guardrails                # core only\npip install \"forge-guardrails[anthropic]\"   # + Anthropic client\n```\n\nFor development:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fantoinezambelli\u002Fforge.git\ncd forge\npip install -e \".[dev]\"\n```\n\n### Backend setup (pick one)\n\n**llama-server** (recommended — top 10 eval configs all run on llama-server):\n```bash\n# Install from https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp\u002Freleases\nllama-server -m path\u002Fto\u002FMinistral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080\n```\n\n**Ollama** (alternative — easier setup, slightly weaker on harder workloads):\n```bash\n# Install from https:\u002F\u002Follama.com\u002Fdownload\nollama pull ministral-3:8b-instruct-2512-q4_K_M\n```\n\n**Anthropic** (API, no local GPU needed):\n```bash\npip install -e \".[anthropic]\"\nexport ANTHROPIC_API_KEY=sk-...\n```\n\nSee [Backend Setup](docs\u002FBACKEND_SETUP.md) for full instructions and [Model Guide](docs\u002FMODEL_GUIDE.md) for which model fits your hardware.\n\n## Quick Start\n\nStart llama-server however you normally do (e.g. in a separate shell):\n\n```bash\nllama-server -m path\u002Fto\u002FMinistral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080\n```\n\nThen the Python you'll run (e.g. from another shell):\n\n```python\nimport asyncio\nfrom pydantic import BaseModel, Field\nfrom forge import (\n    Workflow, ToolDef, ToolSpec,\n    WorkflowRunner, LlamafileClient,\n    ContextManager, TieredCompact,\n)\n\ndef get_weather(city: str) -> str:\n    return f\"72°F and sunny in {city}\"\n\nclass GetWeatherParams(BaseModel):\n    city: str = Field(description=\"City name\")\n\nworkflow = Workflow(\n    name=\"weather\",\n    description=\"Look up weather for a city.\",\n    tools={\n        \"get_weather\": ToolDef(\n            spec=ToolSpec(\n                name=\"get_weather\",\n                description=\"Get current weather\",\n                parameters=GetWeatherParams,\n            ),\n            callable=get_weather,\n        ),\n    },\n    required_steps=[],\n    terminal_tool=\"get_weather\",\n    system_prompt_template=\"You are a helpful assistant. Use the available tools to answer the user.\",\n)\n\nasync def main():\n    client = LlamafileClient(\n        gguf_path=\"path\u002Fto\u002FMinistral-3-8B-Instruct-2512-Q8_0.gguf\",\n        mode=\"native\",\n        recommended_sampling=True,\n    )\n    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)\n    runner = WorkflowRunner(client=client, context_manager=ctx)\n    await runner.run(workflow, \"What's the weather in Paris?\")\n\nasyncio.run(main())\n```\n\nFor multi-step workflows, multi-turn conversations, and backend auto-management, see the [User Guide](docs\u002FUSER_GUIDE.md). If you're building a long-running session (CLI, chat server, voice assistant), see the [long-running session advisory](docs\u002FUSER_GUIDE.md#long-running-sessions-filtering-transient-messages) for important guidance on filtering transient messages.\n\n## Proxy Server\n\nDrop-in proxy that sits between any client and a local model server, speaking both the OpenAI chat-completions API and the Anthropic Messages API (`\u002Fv1\u002Fmessages`). Point your client at the proxy (e.g. `http:\u002F\u002Flocalhost:8081\u002Fv1`) and forge applies its guardrails transparently — the client thinks it's talking to a smarter model.\n\nThis is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite.\n\n```bash\n# External mode — you manage the backend, forge proxies it\npython -m forge.proxy --backend-url http:\u002F\u002Flocalhost:8080 --port 8081\n\n# Managed mode — forge starts the backend and the proxy together\npython -m forge.proxy --backend llamaserver --gguf path\u002Fto\u002Fmodel.gguf --port 8081\n```\n\nThen configure your client to use `http:\u002F\u002Flocalhost:8081\u002Fv1` as the API base URL.\n\n**Claude Code:** the proxy also serves the Anthropic Messages API on `POST \u002Fv1\u002Fmessages`, so you can point Claude Code at a forge-guarded local model — set `ANTHROPIC_BASE_URL=http:\u002F\u002Flocalhost:8081` and `ANTHROPIC_AUTH_TOKEN=anything` for the `claude` process. See [Using forge with Claude Code](docs\u002FUSER_GUIDE.md#using-forge-with-claude-code) for the full setup (native-vs-prompt FC, Anthropic-shape downstreams, `cache_control`).\n\n**Backend compatibility:**\n\n- **Managed mode** spins up the backend for you. Supported backends: `llamaserver`, `llamafile`, `ollama` (use `--backend \u003Cname>` with `--gguf` for the GGUF-based backends, or `--model` for ollama).\n- **External mode** is backend-agnostic — forge talks `POST \u002Fv1\u002Fchat\u002Fcompletions` to whatever you point `--backend-url` at, as long as it speaks the OpenAI schema. Tool calls must come back in OpenAI `tool_calls` format or in one of forge's rescue-parsed formats (Mistral `[TOOL_CALLS]`, Qwen `\u003Ctool_call>` XML, fenced JSON).\n\n### What proxy mode fortifies\n\nOn every `POST \u002Fv1\u002Fchat\u002Fcompletions`, forge applies (in order):\n\n1. **Response validation** — each tool call in the model's response is checked against the `tools` array in the request. Calls to unknown tool names or with malformed shapes are caught before the response returns to your client.\n2. **Rescue parsing** — when the model emits tool calls in the wrong format (JSON in a code fence, Mistral's `[TOOL_CALLS]name{args}`, Qwen's `\u003Ctool_call>...\u003C\u002Ftool_call>` XML), forge extracts the structured call and re-emits it in the canonical OpenAI `tool_calls` schema. Biggest practical lift for Mistral-family models.\n3. **Retry loop with error tracking** — if validation fails, forge retries inference up to `--max-retries` (default 3) with a corrective tool-result message on the canonical channel, rather than returning a malformed response. From your client's perspective the proxy looks like a single request that just took a few extra ms.\n4. **Synthetic `respond` tool injection** — when tools are present in the request, forge injects a synthetic `respond` tool the model calls instead of producing bare text. The `respond` call is stripped from the outbound response — the client sees a normal text response (`finish_reason: \"stop\"`) and never knows the tool exists. Essential for small local models (~8B) that can't be trusted to choose correctly between text and tool calls. See [ADR-013](docs\u002Fdecisions\u002F013-text-response-intent.md) for the full analysis.\n\n### What proxy mode does *not* do\n\nProxy mode is single-shot per request; some forge features need multi-turn workflow state that the OpenAI chat-completions schema doesn't carry:\n\n- **Prerequisite enforcement and step-ordering** — these need a workflow definition spanning turns. Available in `WorkflowRunner`.\n- **Context compaction and session memory** — proxy mode forwards the inbound message list as-is; managing the rolling window is the client's job.\n- **VRAM-aware budget detection** — opt in with `--budget-mode forge-full` or `--budget-mode forge-fast`; otherwise proxy uses the backend's reported budget.\n\nFor the full guardrail surface, use `WorkflowRunner` directly. The proxy trades depth for \"use forge with your existing setup, no rewrite.\"\n\n### Useful flags\n\n| Flag | Default | Purpose |\n|---|---|---|\n| `--max-retries N` | 3 | Retry budget per validation failure |\n| `--no-rescue` | (rescue on) | Disable rescue parsing (debugging only) |\n| `--budget-mode {backend,manual,forge-full,forge-fast}` | `backend` | Context budget source |\n| `--budget-tokens N` | — | Manual token budget (requires `--budget-mode manual`) |\n| `--serialize` \u002F `--no-serialize` | auto | Force request serialization (single-slot backends) |\n\n### Docker\n\nYou can run the forge proxy as a Docker container.\n\n**Build the image:**\n\n```bash\ndocker build -t forge-proxy .\n```\n\n**Run the container:**\n\n```bash\n# Connect to an external backend (e.g. vLLM hosted on the same machine)\ndocker run -p 8081:8081 forge-proxy --backend-url http:\u002F\u002Fhost.docker.internal:8000 --budget-mode manual --budget-tokens 8192\n```\n\nNote: If your backend is running on `localhost` of the host machine, use `http:\u002F\u002Fhost.docker.internal:PORT` (on macOS\u002FWindows) or the host's IP address to allow the container to reach it.\n\n## Backends\n\n| Backend | Best for | Native FC? |\n|---------|----------|------------|\n| **Ollama** | Easiest setup, model management built-in | Yes |\n| **llama-server** | Best performance, full control | Yes (with `--jinja`) |\n| **Llamafile** | Single binary, zero dependencies | No (prompt-injected) |\n| **Anthropic** | Frontier baseline, hybrid workflows | Yes |\n\nSee [Backend Setup](docs\u002FBACKEND_SETUP.md) for installation and [Model Guide](docs\u002FMODEL_GUIDE.md) for which model to pick.\n\n## Running Tests\n\n```bash\npython -m pytest tests\u002F -v --tb=short\n```\n\n```bash\npython -m pytest tests\u002F --cov=forge --cov-report=term-missing\n```\n\n## Eval Harness\n\n26 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows — split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end separation. See [Eval Guide](docs\u002FEVAL_GUIDE.md) for full CLI reference.\n\n```bash\n# llama-server (start in another terminal first; see Eval Guide)\npython -m tests.eval.eval_runner --backend llamafile --llamafile-mode prompt --gguf \"path\u002Fto\u002FMinistral-3-8B-Instruct-2512-Q8_0.gguf\" --runs 10 --stream --verbose\n\n# Batch eval (JSONL output, automatic resume)\npython -m tests.eval.batch_eval --config all --runs 50\n\n# Reports — ASCII table by default; --html \u002F --markdown export views\npython -m tests.eval.report eval_results.jsonl\npython -m tests.eval.report eval_results.jsonl --html docs\u002Fresults\u002Fdashboard.html\npython -m tests.eval.report eval_results.jsonl --markdown docs\u002Fresults\u002Fraw\u002F\n```\n\n## Project Structure\n\n```\nsrc\u002Fforge\u002F\n  __init__.py          # Public API exports\n  errors.py            # ForgeError hierarchy\n  server.py            # setup_backend(), ServerManager, BudgetMode\n  core\u002F\n    messages.py        # Message, MessageRole, MessageType, MessageMeta\n    workflow.py        # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow\n    inference.py       # run_inference() — shared front half (compact, fold, validate, retry)\n    runner.py          # WorkflowRunner — the agentic loop\n    slot_worker.py     # SlotWorker — priority-queued slot access\n    steps.py           # StepTracker\n  guardrails\u002F\n    guardrails.py      # Guardrails facade — applies the full stack in foreign loops\n    nudge.py           # Nudge dataclass\n    response_validator.py  # ResponseValidator, ValidationResult\n    step_enforcer.py   # StepEnforcer, StepCheck\n    error_tracker.py   # ErrorTracker\n  clients\u002F\n    base.py            # ChunkType, StreamChunk, LLMClient protocol\n    ollama.py          # OllamaClient (native FC)\n    llamafile.py       # LlamafileClient (native FC or prompt-injected)\n    anthropic.py       # AnthropicClient (frontier baseline)\n  context\u002F\n    manager.py         # ContextManager, CompactEvent\n    strategies.py      # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact\n    hardware.py        # HardwareProfile, detect_hardware()\n  prompts\u002F\n    templates.py       # Tool prompt builders (prompt-injected path)\n    nudges.py          # Retry and step-enforcement nudge templates\n  tools\u002F\n    respond.py         # Synthetic respond tool (respond_tool(), respond_spec())\n  proxy\u002F\n    __main__.py        # CLI entry point: python -m forge.proxy\n    proxy.py           # ProxyServer — programmatic start\u002Fstop API\n    server.py          # Raw asyncio HTTP server, SSE streaming\n    handler.py         # Request handler — bridge between HTTP and run_inference\n    convert.py         # OpenAI messages ↔ forge Messages conversion\ntests\u002F\n  unit\u002F                # 865 deterministic tests — no LLM backend required\n  eval\u002F                # Eval harness — model qualification against real backends\n```\n\n## Documentation\n\n- [User Guide](docs\u002FUSER_GUIDE.md) — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory\n- [Model Guide](docs\u002FMODEL_GUIDE.md) — Which model and backend for your hardware\n- [Backend Setup](docs\u002FBACKEND_SETUP.md) — Backend installation and server setup\n- [Eval Guide](docs\u002FEVAL_GUIDE.md) — Eval harness CLI reference, batch eval\n- [Architecture](docs\u002FARCHITECTURE.md) — Full design document\n- [Workflow Internals](docs\u002FWORKFLOW.md) — Workflow design and runner internals\n- [Contributing](CONTRIBUTING.md) — How to set up, test, and add new backends or scenarios\n\n## Paper\n\nThe forge guardrail framework and ablation study are published as:\n\n> Zambelli, A. *Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling.*\n> [https:\u002F\u002Fdoi.org\u002F10.1145\u002F3786335.3813193](https:\u002F\u002Fdoi.org\u002F10.1145\u002F3786335.3813193)\n\nA pre-publication preprint is also available at [docs\u002Fforge_ieee_preprint.pdf](docs\u002Fforge_ieee_preprint.pdf) — kept as a historical artifact. Cite the published version above; the DOI link may not resolve immediately depending on the publisher's release timing.\n\n## License\n\n[MIT](LICENSE) — Copyright (c) 2025-2026 Antoine Zambelli\n","Forge 是一个用于自托管的大规模语言模型（LLM）工具调用和多步骤代理工作流的 Python 框架。它通过提供一套可靠性机制，如救援解析、重试提示和响应验证等，确保 LLM 能够安全可靠地调用开发者定义的各种工具，并支持按需设定工作流程结构。项目特别适用于需要在本地部署环境下增强 LLM 与外部工具交互稳定性的场景，例如自动化任务处理或基于对话的服务开发。此外，Forge 还提供了多种使用模式，包括作为代理服务器、直接运行工作流以及作为中间件集成到现有系统中，使得用户可以根据具体需求灵活选择最适合的方式来提升应用性能。","2026-06-11 03:57:06","high_star"]