[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79954":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":15,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":29,"discoverSource":30},79954,"renderers","PrimeIntellect-ai\u002Frenderers","PrimeIntellect-ai","Programmable chat templates for LLM training and inference.","",null,"Python",110,18,2,6,0,3,32,9,3.84,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:56","# renderers\n\nProgrammable chat templates for LLM training and inference. A renderer turns a model's chat template into a Python object that can render messages → token ids, parse completion ids → structured assistant messages, and extend a multi-turn rollout without re-rendering model-sampled history.\n\nStandalone on PyPI, and portable across training and inference stacks (transformers, vLLM, SGLang, Tinker). Initially developed for RL training with [verifiers](https:\u002F\u002Fgithub.com\u002FPrimeIntellect-ai\u002Fverifiers) and `prime-rl` at Prime Intellect.\n\n## Install\n\n```bash\nuv add renderers\n```\n\n## At a glance\n\n```python\nfrom transformers import AutoTokenizer\nfrom renderers import create_renderer\n\ntok = AutoTokenizer.from_pretrained(\"Qwen\u002FQwen3-8B\")\nr = create_renderer(tok)                            # → Qwen3Renderer (auto-resolved)\n\nprompt_ids = r.render_ids(\n    [{\"role\": \"user\", \"content\": \"hi\"}],\n    add_generation_prompt=True,\n)\n# Feed prompt_ids to a Token-In, Token-Out endpoint.\n# It returns completion_ids sampled by the model.\n\nparsed = r.parse_response(completion_ids)\n# ParsedResponse(content=..., reasoning_content=..., tool_calls=...)\n```\n\nFor the next turn, extend the previous sampled stream instead of re-rendering history:\n\n```python\nnext_prompt_ids = r.bridge_to_next_turn(\n    previous_prompt_ids=prompt_ids,\n    previous_completion_ids=completion_ids,\n    new_messages=[{\"role\": \"tool\", \"content\": \"...\"}],\n)\n```\n\nHand-coded renderers ship for `qwen3`, `qwen3-vl`, `qwen3.5`, `qwen3.6`, `glm-5`, `glm-5.1`, `glm-4.5`, `minimax-m2`, `deepseek-v3`, `kimi-k2`, `kimi-k2.5`, `nemotron-3`, `gpt-oss`. Anything else falls back to `DefaultRenderer`, a generic `apply_chat_template` wrapper.\n\n## API\n\n```python\nclass Renderer(Protocol):\n    def render(messages, *, tools=None, add_generation_prompt=False) -> RenderedTokens: ...\n    def render_ids(messages, *, tools=None, add_generation_prompt=False) -> list[int]: ...\n    def parse_response(token_ids) -> ParsedResponse: ...\n    def get_stop_token_ids() -> list[int]: ...\n    def bridge_to_next_turn(prev_prompt_ids, prev_completion_ids, new_messages, *, tools=None) -> list[int] | None: ...\n```\n\n- `RenderedTokens` carries `token_ids` **and** `message_indices` — one entry per token attributing each to its source message (`-1` for structural scaffolding). Lets `build_training_sample` build a per-token loss mask in one render.\n- `ParsedResponse` is `(content, reasoning_content, tool_calls)`. It scans token ids for special-token boundaries (e.g. id `151657` for `\u003Ctool_call>` on Qwen3) — a literal `\"\u003Ctool_call>\"` in user content tokenizes to ordinary text ids and never matches.\n- Round-trip: rendering `[user, assistant(content, reasoning, tool_calls)]`, slicing the assistant completion, and feeding it through `parse_response` returns an equivalent structured message. Tested per-renderer in `tests\u002Ftest_roundtrip.py`.\n\n### `bridge_to_next_turn` (the core contract)\n\nGiven `(prev_prompt_ids, prev_completion_ids)` and new environment messages, return ids for the next turn's prompt such that the result starts with `prev_prompt_ids + prev_completion_ids` byte-for-byte and continues with the new messages plus the next assistant opener. If that cannot be proven safe, return `None` and the caller falls back to a full render.\n\nEach hand-coded bridge:\n1. Anchors at the previous turn's canonical close token. On clean stops it's already in `prev_completion_ids`. On truncation, the renderer synthesizes the close as non-loss prompt context.\n2. Refuses assistant content in `new_messages` — re-rendering sampled tokens would replace them with canonical template bytes.\n3. Renders only the new messages in the framing the model family expects.\n\n`DefaultRenderer.bridge_to_next_turn` returns `None` unconditionally — the template's close is unknown, so the contract can't be proven.\n\n### Picking a renderer\n\n```python\nr = create_renderer(tok)                # AutoRendererConfig is the implicit default\n```\n\nAuto-detect matches `tokenizer.name_or_path` against `MODEL_RENDERER_MAP` by **exact match**. Prefix matching is intentionally off — same architecture can ship different chat templates (base vs instruct, fine-tune renames). Fine-tunes must pass an explicit typed config (e.g. `Qwen3RendererConfig()`); unknown names fall back to `DefaultRenderer`.\n\n### Pools\n\n```python\nfrom renderers import create_renderer_pool\n\npool = create_renderer_pool(\"Qwen\u002FQwen3-8B\", size=16)\nwith pool.checkout() as r:\n    ids = r.render_ids(messages)\n```\n\nEach slot owns its own tokenizer copy. Construction fans out across a thread pool so a 32-slot pool doesn't serially eat ~10–15s of `from_pretrained` calls at startup.\n\n## Why use a renderer\n\nFor RL the trainer must see the exact token ids the sampler saw. The standard alternative — let the inference engine apply the chat template, parse tool calls, parse reasoning, and re-render full history every turn — silently breaks token identity. These are the failure modes a renderer's `bridge_to_next_turn` sidesteps by never re-rendering prior turns:\n\n- **Boolean round-trip.** Engine emits `false`; client parses to Python `bool(False)`; `apply_chat_template` re-renders via `str(False)` → `\"False\"`. Capital F. Reproducible on Qwen3.5-35B-A3B + mini-swe-agent-plus at ~50% break rate per rollout.\n- **BPE retokenization drift.** The same substring tokenizes differently depending on neighbouring bytes. `json` + `p` + `enderer` (3 tokens) vs `jsonp` + `enderer` (2 tokens) when whitespace shifts by one character. Every subsequent token is shifted from there on.\n- **Tool-call XML drift.** The engine emits a no-arg call with a stylistic empty `\u003C\u002Fparameter>`; the Jinja re-render of the reconstructed dict drops it. Extension property broken at every such call.\n- **Thinking stripped from non-latest assistants.** Some templates strip `\u003Cthink>…\u003C\u002Fthink>` blocks from prior assistant turns when re-rendering. The recorded stream has the thinking; the next prompt does not.\n- **Max-seq-len truncation zeroing the anchor.** Client-side `max_seq_len` enforcement zeros `completion_ids` when `prompt_len > max_seq_len`. The bridge anchor is empty, falling back to full re-render — triggering every mode above.\n- **Scaffold-level history rewriting.** Some agent scaffolds (e.g. opencode's `experimental_repairToolCall`) rewrite tool calls before sending them back as history. The next turn's prompt contains a tool call the model never emitted. *A renderer cannot fix this — the drift happens before rendering.*\n\nEmpirical delta on Qwen3.5-35B-A3B + mini-swe-agent-plus, step 0:\n\n| client path                            | breaks | training samples from 64 rollouts |\n| -------------------------------------- | ------ | --------------------------------- |\n| `apply_chat_template` (full re-render) | 32     | 77                                |\n| renderers `bridge_to_next_turn`        | 0      | 64                                |\n\nEach break fragments a rollout into multiple training samples — every fragment re-encodes its prefix, inflating compute roughly linearly with the number of breaks.\n\n## Typed renderer configs\n\nEach renderer accepts a typed pydantic config that pins its template-control kwargs at construction. `create_renderer` and `create_renderer_pool` take one positional `config` argument:\n\n```python\nfrom renderers import (\n    create_renderer,\n    AutoRendererConfig,\n    Qwen3RendererConfig,\n    GLM5RendererConfig,\n    DefaultRendererConfig,\n)\n\n# Auto-resolve renderer from the tokenizer's model name. Carries the\n# shared preserve_* flags; template kwargs require an explicit choice.\nrenderer = create_renderer(tokenizer)\nrenderer = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True))\n\n# Explicit choice — the typed config exposes exactly the fields that\n# renderer's chat template honours.\nrenderer = create_renderer(tokenizer, Qwen3RendererConfig(enable_thinking=False))\nrenderer = create_renderer(tokenizer, GLM5RendererConfig(clear_thinking=False))\n\n# Default renderer (apply_chat_template fallback) — extra fields are\n# captured via pydantic ``extra=\"allow\"`` and forwarded to the Jinja\n# template; tool \u002F reasoning parsers are typed.\nrenderer = create_renderer(\n    tokenizer,\n    DefaultRendererConfig(tool_parser=\"qwen3\", reasoning_parser=\"think\"),\n)\n```\n\nDiscriminated union: every per-renderer config is a variant of `RendererConfig`, dispatched on the `name` field. Bogus combinations (e.g. `add_vision_id` under `name=\"qwen3\"`) error at construction with a `pydantic.ValidationError`. Downstream pydantic configs (prime-rl orchestrator, verifiers `ClientConfig`) hold a single field typed as `RendererConfig` and inherit the same strict-per-variant validation.\n\nTwo shared behaviour flags live on every variant via `_BaseRendererConfig`:\n\n- `preserve_all_thinking=True` — every past assistant's `reasoning_content` is kept, even when the chat template would drop it.\n- `preserve_thinking_between_tool_calls=True` — reasoning is kept on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking.\n\nThese OR-compose with template-level toggles (e.g. GLM-5 `clear_thinking`, Nemotron-3 `truncate_history_thinking`): either flag saying \"keep\" wins. preserve_* can only ever *extend* retention — never override a template kwarg into a \"drop\" decision. The canonical use case is **compaction**: injecting a `user` turn like *\"summarize the work so far\"* puts every prior assistant in a past cycle, and `preserve_all_thinking=True` keeps reasoning visible end-to-end.\n\n## `DefaultRenderer`\n\nFallback for unsupported models. Wraps `apply_chat_template` and accepts `tool_parser` \u002F `reasoning_parser` (vLLM convention) plus arbitrary Jinja kwargs via `DefaultRendererConfig`'s `extra=\"allow\"`. `bridge_to_next_turn` returns `None` because the template's close is unknown, so multi-turn rollouts fall back to full re-render. Implementing a hand-coded renderer is a few hundred lines of Python (`render_ids` + `parse_response` + `bridge_to_next_turn`) and is the only path that closes the failure modes above by construction.\n\n## Roadmap\n\n- **VLM support.** `ContentPart` is text-only today; `Qwen3VLRenderer` ships only because Qwen3-VL's text-only chat template differs from Qwen3's. Plan: add `ImagePart` \u002F `VideoPart`, multimodal bridges, validate against a Qwen3-VL RL run.\n- **Patched chat templates.** Some shipped templates re-tokenize history, normalize JSON, or auto-strip thinking — each breaks the extension property. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns.\n\n## Testing\n\n```bash\nuv sync --group dev\nuv run pytest\n```\n\nRound-trip parity (render → parse → original) and token-level parity against `apply_chat_template` are tested per renderer. End-to-end validation runs against Reverse-Text, Wordle, OpenCode-Math, and RLM-SWE environments.\n\n## License\n\nLicensed under the [Apache License, Version 2.0](LICENSE).\n","PrimeIntellect-ai\u002Frenderers 是一个用于大语言模型（LLM）训练和推理的可编程聊天模板库。其核心功能包括将模型的聊天模板转换为Python对象，实现消息到token ID的渲染、解析完成ID为结构化的助手消息，并在多轮对话中扩展而无需重新渲染历史记录。该库支持多种主流模型如Qwen、GLM等，并提供通用的默认渲染器。renderers独立于PyPI发布，兼容transformers、vLLM等多种训练和推理框架。适用于需要高效处理基于文本的对话数据集生成或增强现有对话系统的场景，特别是在强化学习训练环境中与验证器配合使用时效果显著。","2026-06-11 03:58:41","CREATED_QUERY"]