[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81005":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":11,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":14,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":15,"rankGlobal":9,"rankLanguage":9,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":17,"hasPages":17,"topics":19,"createdAt":9,"pushedAt":9,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":14,"starSnapshotCount":14,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},81005,"slime-n","slime-n\u002Fslime-n","A Multi-Policy, Multi-Agent RL Training Framework","",null,"Python",30,1,4,0,0.9,"Apache License 2.0",false,"main",[],"2026-06-12 02:04:09","\u003Cdiv align=\"center\">\n\n# slime\u003Csup>[n](https:\u002F\u002Fgithub.com\u002Fslime-n\u002Fslime-n)\u003C\u002Fsup>\n\n### A Multi-Policy, Multi-Agent RL Framework\n\n### One config. Any mix of training, inference engines. From 1 policy to 100+.\n\u003C\u002Fdiv>\n\n\nslime\u003Csup>n\u003C\u002Fsup> extends [slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime) into a flexible multi-policy, multi-agent RL training framework.\n\nUnlike most RL frameworks that assume a fixed structure — such as a single trainer or a hard-coded actor–critic setup — slime\u003Csup>n\u003C\u002Fsup> takes a compositional approach: each run is defined as a list of components, freely assembled from three primitives:\n\n- **Trainable policy pair**: a Megatron training actor paired with an SGLang rollout engine.\n- **Standalone Megatron actor**: a Megatron-only component, either trainable or frozen.\n- **Standalone SGLang engine**: an inference-only engine for frozen policies, reward models, judges, or verifiers.\n\n![slime^n](.\u002Fimgs\u002Farch_2.png) \n\nWith this unified schema, the same framework can support on-policy distillation, cooperative multi-agent RL, asymmetric PPO, reward-model serving, and other multi-policy workloads — without custom plumbing for each setup.\n\n\n\n## Use Cases\n\nOnce **policy** is the unit of ownership — weights, optimizer, buffer, checkpoint — a wide class of multi-role RL systems becomes natural to express. The schema already covers three families, all composed from the same three primitives.\n\nFor each multi-agent use case below, the left figure is the **conceptual schema** (the data flow between roles) and the right figure is the **slime\u003Csup>n\u003C\u002Fsup> framework** view (the Megatron \u002F SGLang policy layout at runtime). PPO and OPD show the framework view only.\n\n### 1. Asymmetric PPO\n\nActor and critic are separate policies, each with its own architecture, optimizer, and buffer — the critic a standalone Megatron value head with no SGLang engine. Code: [`examples\u002Fmulti_policy_ppo`](examples\u002Fmulti_policy_ppo).\n\n| |\n|:---:|\n| ![Asymmetric PPO](.\u002Fexamples\u002Fmulti_policy_ppo\u002Fimgs\u002Farch.png) |\n| **Asymmetric PPO** (actor + critic) |\n\n### 2. On-policy Distillation\n\nA trainable student paired with a frozen teacher that returns per-token logprobs for a reverse-KL term. The teacher runs on either backend. Code: [`examples\u002Fmulti_policy_opd_megatron`](examples\u002Fmulti_policy_opd_megatron) · [`examples\u002Fmulti_policy_opd_sglang`](examples\u002Fmulti_policy_opd_sglang).\n\n| | |\n|:---:|:---:|\n| ![On-policy Distillation — Megatron teacher](.\u002Fexamples\u002Fmulti_policy_opd_megatron\u002Fimgs\u002Farch.png) | ![On-policy Distillation — SGLang teacher](.\u002Fexamples\u002Fmulti_policy_opd_sglang\u002Fimgs\u002Farch.png) |\n| **On-policy Distillation — Megatron teacher** | **On-policy Distillation — SGLang teacher** |\n\n### 3. Multi-Agent Systems (Multiple policies)\n\nMultiple trainable policies cooperating in a single run — debate, candidate generation + synthesis, cooperative swarms, generator\u002Fverifier loops, orchestrator + subagents, shared-state rounds, and staged solver pipelines.\n\n**3.1 Consensus Debate**\n\nN generator agents propose independent answers, then in later rounds each critic agent revises its own answer against a summary of the other agents' responses, with the majority-vote answer as the only training signal. Code: [`examples\u002Fmulti_policy_consensus_debate`](examples\u002Fmulti_policy_consensus_debate).\n\n| schema | slime\u003Csup>n\u003C\u002Fsup> |\n|:---:|:---:|\n| ![Consensus debate schema](.\u002Fexamples\u002Fmulti_policy_consensus_debate\u002Fimgs\u002Fschema.png) | ![Consensus debate framework](.\u002Fexamples\u002Fmulti_policy_consensus_debate\u002Fimgs\u002Farch.png) |\n\n**3.2 Solver + Summarizer**\n\nThe solver generates N candidate solutions per prompt and the summarizer synthesizes them into a single final answer, with both policies trained jointly on their own correctness rewards plus group reward shaping. Code: [`examples\u002Fmulti_policy_solver_summarizer`](examples\u002Fmulti_policy_solver_summarizer).\n\n| schema | slime\u003Csup>n\u003C\u002Fsup> |\n|:---:|:---:|\n| ![Solver+Summarizer schema](.\u002Fexamples\u002Fmulti_policy_solver_summarizer\u002Fimgs\u002Fschema.png) | ![Solver+Summarizer framework](.\u002Fexamples\u002Fmulti_policy_solver_summarizer\u002Fimgs\u002Farch.png) |\n\n**3.3 Generator + Verifier**\n\nThe generator answers, the verifier critiques, and the generator revises with its round-1 answer carried forward — two trainable policies looping answer → critique → revise. Code: [`examples\u002Fmulti_policy_generator_verifier`](examples\u002Fmulti_policy_generator_verifier).\n\n| schema | slime\u003Csup>n\u003C\u002Fsup> |\n|:---:|:---:|\n| ![Generator+Verifier schema](.\u002Fexamples\u002Fmulti_policy_generator_verifier\u002Fimgs\u002Fschema.png) | ![Generator+Verifier framework](.\u002Fexamples\u002Fmulti_policy_generator_verifier\u002Fimgs\u002Farch.png) |\n\n**3.4 Orchestrator + Subagents**\n\nAn orchestrator plans and dispatches the prompt to several subagents pursuing different approaches, then synthesizes their returned results into a final answer. Code: [`examples\u002Fmulti_policy_orchestrator_subagent`](examples\u002Fmulti_policy_orchestrator_subagent).\n\n| schema | slime\u003Csup>n\u003C\u002Fsup> |\n|:---:|:---:|\n| ![Orchestrator+Subagent schema](.\u002Fexamples\u002Fmulti_policy_orchestrator_subagent\u002Fimgs\u002Fschema.png) | ![Orchestrator+Subagent framework](.\u002Fexamples\u002Fmulti_policy_orchestrator_subagent\u002Fimgs\u002Farch.png) |\n\n**3.5 Cooperative Swarm**\n\nEight independent agents answer the same prompt in parallel, blending a per-agent reward from self-GRPO, swarm EMA pass-rate, and peer ranking into a single advantage. Code: [`examples\u002Fmulti_policy_exam_swarm`](examples\u002Fmulti_policy_exam_swarm).\n\n| schema | slime\u003Csup>n\u003C\u002Fsup> |\n|:---:|:---:|\n| ![Swarm schema](.\u002Fexamples\u002Fmulti_policy_exam_swarm\u002Fimgs\u002Fschema.png) | ![Swarm framework](.\u002Fexamples\u002Fmulti_policy_exam_swarm\u002Fimgs\u002Farch.png) |\n\n**3.6 Shared-State Peers**\n\nTwo peers alternately read from and write to a versioned shared state across rounds, each round's updated state feeding both peers in the next. Code: [`examples\u002Fmulti_policy_shared_state`](examples\u002Fmulti_policy_shared_state).\n\n| schema | slime\u003Csup>n\u003C\u002Fsup> |\n|:---:|:---:|\n| ![Shared-State schema](.\u002Fexamples\u002Fmulti_policy_shared_state\u002Fimgs\u002Fschema.png) | ![Shared-State framework](.\u002Fexamples\u002Fmulti_policy_shared_state\u002Fimgs\u002Farch.png) |\n\n**3.7 Solver-Rewriter-Selector**\n\nThe solver emits N candidates, the rewriter refines them after seeing all N, and the selector picks the single best answer out of the N. Code: [`examples\u002Fmulti_policy_solver_rewriter_selector`](examples\u002Fmulti_policy_solver_rewriter_selector).\n\n| schema | slime\u003Csup>n\u003C\u002Fsup> |\n|:---:|:---:|\n| ![Solver-Rewriter-Selector schema](.\u002Fexamples\u002Fmulti_policy_solver_rewriter_selector\u002Fimgs\u002Fschema.png) | ![Solver-Rewriter-Selector framework](.\u002Fexamples\u002Fmulti_policy_solver_rewriter_selector\u002Fimgs\u002Farch.png) |\n\n\n\n\n## Implementation Details\n\n- **`train_multi_policy.py`** — driver for n≥1 trainable policies. Replaces `train.py` for multi-policy runs.\n- **YAML-driven configs** — `--config \u003Cpath>.yaml`. Per-policy fields (parallelism, batching, optimizer, loss, paths, Megatron numerical \u002F dropout, `log_probs_chunk_size`) live in the YAML; cluster sizing is derived from policies. See [`slime\u002Futils\u002Fpolicy_config.py`](slime\u002Futils\u002Fpolicy_config.py).\n- **Per-policy buffers (split mode)** — each policy trains on its own samples, tagged via `Sample.policy_name`.\n- **Per-policy weight sync** — serialized push from each Megatron actor to its paired sglang engine.\n\n## Multi-Policy YAML Config\n\nMulti-policy runs are defined by a single YAML file passed with `--config`. The top-level `policies` list is the source of truth for the run composition: each entry declares one policy's identity, trainability, checkpoints, buffer routing, GPU slice, Megatron training settings, and optional SGLang engine settings. Policy names must be unique, and each paired policy gets a 1:1 SGLang server with the same name.\n\n```yaml\npolicies:\n  - name: solver\n    role: actor\n    trainable: true\n    hf_checkpoint: \u002Froot\u002FQwen3-0.6B\n    load: \u002Fckpt\u002Fsolver\n    buffer_mode: split\n\n    num_gpus_per_node: 1\n    megatron_num_nodes: 1\n    sglang_num_nodes: 1\n\n    megatron:\n      tensor_model_parallel_size: 1\n      global_batch_size: 64\n      lr: 1.0e-6\n      advantage_estimator: grpo\n      n_samples_per_prompt: 8\n\n    sglang:\n      num_gpus_per_engine: 1\n      mem_fraction_static: 0.85\n\n  - name: summarizer\n    role: actor\n    trainable: true\n    hf_checkpoint: \u002Froot\u002FQwen3-0.6B\n    load: \u002Fckpt\u002Fsummarizer\n    buffer_mode: split\n\n    num_gpus_per_node: 1\n    megatron_num_nodes: 1\n    sglang_num_nodes: 1\n\n    megatron:\n      tensor_model_parallel_size: 1\n      global_batch_size: 64\n      lr: 1.0e-6\n      advantage_estimator: grpo\n      n_samples_per_prompt: 8\n\n    sglang:\n      num_gpus_per_engine: 1\n      mem_fraction_static: 0.85\n```\n\nThe example above defines the solver+summarizer multi-policy run: `solver` generates 8 candidate solutions per prompt, and `summarizer` synthesizes a final answer over those candidates. Both policies use `n_samples_per_prompt: 8` so GRPO has a group of size 8 for advantage normalization on each side. Each trainable policy has its own paired Megatron actor and SGLang engine; both train on split buffers tagged via `Sample.policy_name`.\n\nThe `megatron:` block is flattened into the per-policy Megatron argument namespace, so parallelism, recompute, batching, optimizer, loss, KL, and OPD fields can differ by policy. The `sglang:` block is projected into the SGLang model\u002Fserver config; `model_path` defaults to `hf_checkpoint`, and server arguments such as `mem_fraction_static`, `cuda_graph_bs`, and `max_total_tokens` are passed through.\n\nCluster sizing is derived from the YAML. Without `--colocate`, total GPUs are `sum(megatron_num_nodes * num_gpus_per_node) + sum(sglang_num_nodes * num_gpus_per_node)` across active policies. With `--colocate`, slime uses the larger of the Megatron and SGLang sides. A frozen standalone Megatron teacher sets `trainable: false` and `sglang_num_nodes: 0`.\n\n## Experimental Results\n\nTwo multi-agent cooperations trained on DAPO-math-17k. In both, every policy carries its own optimizer and split buffer, and rewards rise jointly.\n\n**Consensus Debate** — generator + critic, with the ŷ majority vote over critic outputs as the only signal (gold label ignored). Code: [`examples\u002Fmulti_policy_consensus_debate`](examples\u002Fmulti_policy_consensus_debate).\n\n![consensus debate training reward](.\u002Fexamples\u002Fmulti_policy_consensus_debate\u002Fimgs\u002Freward.png)\n\n**Solver + Summarizer** — solver emits N candidates, summarizer synthesizes a final `\\boxed{...}` answer; both get RLVR correctness rewards plus summarizer-phase group shaping. Code: [`examples\u002Fmulti_policy_solver_summarizer`](examples\u002Fmulti_policy_solver_summarizer).\n\n![solver + summarizer training reward](.\u002Fexamples\u002Fmulti_policy_solver_summarizer\u002Fimgs\u002Freward.png)\n\n\n## Run\n\n```bash\nbash examples\u002Fmulti_policy_two_agent\u002Frun-qwen3-0.6B-two-policy-two-agent.sh\n```\n\nWhich boils down to:\n\n```bash\nray job submit ... -- python3 train_multi_policy.py --config examples\u002Fmulti_policy_two_agent\u002Fconfig.yaml\n```\n\nSee [`train_multi_policy.py`](train_multi_policy.py) for the train-loop body and the architecture figure above (source: `..\u002Ffig_arch_2.typ`) for the runtime layout.\n","slime\u003Csup>n\u003C\u002Fsup> 是一个支持多策略、多智能体的强化学习训练框架。其核心功能包括可自由组合的组件，如可训练的策略对、独立的Megatron演员和独立的SGLang引擎，这些组件可以灵活地组装以适应不同的训练和推理需求。技术特点上，slime\u003Csup>n\u003C\u002Fsup> 采用模块化设计，使得从单策略到多策略（甚至超过100个）的扩展变得简单且无需针对每种设置进行定制化开发。该框架适用于需要执行异构PPO、在线蒸馏、以及协作式多智能体系统等复杂任务的场景，特别是在研究与开发涉及多种角色或策略交互的强化学习应用时尤为适用。",2,"2026-06-11 04:03:10","CREATED_QUERY"]