[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75516":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},75516,"AutoTTS","zhengkid\u002FAutoTTS","zhengkid","The offical repo for \"LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling\"",null,"Python",165,15,4,1,0,9,117,3.61,false,"main",true,[],"2026-06-12 02:03:34","# AutoTTS\n\n**LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling**\n\nTong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang, Rui Liu, Runpeng Dai, Ruibo Chen, Chenxi Liu, Tianyi Xiong, Xidong Wu, Hongming Zhang, Heng Huang\n*UMD · UVA · WUSTL · UNC · Google · Meta*\n\n[Project page](https:\u002F\u002Fzhengkid.github.io\u002FAutoTTS-web\u002F)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figs\u002Fauto-tts.pdf.png\" alt=\"AutoTTS system overview\" width=\"100%\">\n\u003C\u002Fp>\n\nAutoTTS reframes TTS strategy design from **hand-crafting heuristics** to **environment-driven automatic search**: humans only construct an **offline replay environment** (states, actions, feedback, objectives), and a **coding agent** iteratively proposes and refines **code-defined controllers** within it — **code edits, no gradient updates**. **Cheap: 0 LLM calls, fully replay.**\n\n**Quick links:** [Install](#install) · [Reproduction](#reproduction) · [Citation](#citation)\n\n## Highlighted results\n\n- ~**69.5% tokens saved** vs SC@64 at β ≈ 0.5; held-out average accuracy matches SC@64 across four backbone scales.\n- **$39.9** estimated monetary cost for one full discovery run.\n- **160 minutes** wall-clock for the same run.\n- **0** LLM calls during discovery evaluation (replays cached segments only).\n\nThe discovered controller is the **Confidence Momentum Controller (CMC)**, characterized by trend-based stopping, coupled width–depth control, alignment-aware depth allocation, and conservative branch abandonment.\n\n---\n\n## Problem setup\n\nWe treat adaptive test-time inference as allocating a finite budget over branches in fixed-length intervals.\n\n**State** at step `t`:\n\n```text\ns_t = (q, m_t, I_t, ℓ_t, Ω_t)\n```\n\n`q`: question; `m_t`: number of instantiated branches; `I_t`: active branch set; `ℓ_t`: depth vector; `Ω_t`: revealed probe triples.\n\n**Admissible actions** `A(s_t)`:\n\n- `BRANCH` — open a new branch through the first interval.\n- `CONTINUE(i)` — advance branch `i` by one interval.\n- `PROBE(i)` — reveal `ω_{i,ℓ}` without advancing depth.\n- `PRUNE(i)` — deactivate branch `i`; depths and past probes stay recorded.\n- `ANSWER` — terminate and apply the controller's terminal aggregator.\n\n**Cost** in interval units:\n\n```text\nCost(s_t) = Σ_i ℓ_{t,i} + κ_probe · |Ω_t|        (often κ_probe = 0)\n```\n\n**Objective.** A code-defined policy `π(· | s, β)` is parameterized by a scalar meta-parameter `β` that deterministically schedules every internal hyper-parameter. Over tasks `(q, y) ~ 𝒟`:\n\n```text\nmax_{π, β}  E_{q,y}[ 1{ŷ_{π,β}(q) = y}  −  γ · C_{π,β}(q) ]\n```\n\nThe **outer loop** searches over implementations of `π`. Each candidate is replay-evaluated on offline caches; traces and scaling curves enter the next round's history.\n\n---\n\n## Environment construction (run once per (model, benchmark))\n\nThe MDP above is instantiated as a concrete replay environment **before** the discovery loop starts:\n\n1. **Specify the interface.** Fix `s_t`, `A(s_t)`, `Cost(s_t)`, and the accuracy–cost objective.\n2. **Offline trajectory collection.** For each query, draw `N` parallel independent reasoning traces from the backbone (full strings first), then partition each trace into fixed-length segments of `Δ` tokens and enumerate branch prefixes `z_{i,k}` with probe responses `ω_{i,k}`.\n3. **Materialize the replay store.** Every environment transition consults the archived table; e.g. `PROBE(i)` retrieves the cached `ω_{i,k}` without any new decoding.\n4. **Hand off to discovery.** Candidate controllers are simulated exclusively through `observe`\u002F`step`. Asymptotic evaluation cost is dominated by table replay.\n\nSteps 1–3 run once. Iterative coding-agent discovery starts only after the replay store is **frozen**.\n\nIn this repository:\n\n- `efficient_reasoning_controller\u002Fworkspace\u002Fcode_base\u002Fenvironment\u002F` — search-set replay store.\n- `efficient_reasoning_controller\u002Ftest_environment\u002F` — held-out replay store; never exposed to the proposer.\n\n---\n\n## Discovery: β parameterization & trace feedback\n\n- **β parameterization.** Each candidate controller exports a single scalar `β` plus a deterministic, monotonic map from `β` to every internal knob. Outer search collapses to sweeping `β`, eliminating brittle thresholds tuned only to the search set.\n- **History augmentation with execution traces.** Alongside each round's β-sweep we archive both empirical scaling curves *and* the full action-by-action trajectories reconstructed during replay. Traces give the explorer fine-grained behavioral evidence to localize defects before rewriting code.\n\n---\n\n## Main results\n\nAutoTTS is optimized on AIME24 replay constructions and evaluated on held-out AIME25 \u002F HMMT25 benchmarks across four Qwen3 backbone scales. The project page reports the following trends:\n\n- **Better accuracy–token trade-offs.** Discovered controllers typically shift the empirical Pareto frontier beyond handcrafted baselines such as SC@64, ASC, ESC, and Parallel-Probe.\n- **Held-out generalization.** Policies discovered on AIME24 transfer to held-out benchmarks, outperforming every handcrafted baseline on average accuracy for three of four backbone scales and remaining competitive on Qwen3-8B.\n- **β = 0.5 operating point.** Cuts aggregate token usage by roughly **69.5%** compared with SC@64 while matching mean held-out accuracy across models.\n- **β = 1.0 operating point.** Pushes peak accuracy beyond all handcrafted baselines in five of the eight tabulated comparison cells on the project page.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figs\u002Fmain_results_table.png\" alt=\"Main quantitative table across Qwen3 scales and benchmarks\" width=\"100%\">\n\u003C\u002Fp>\n\nSweeping `β` traces accuracy–token scaling curves: larger `β` generally moves toward higher-budget, accuracy-first behavior, while smaller `β` favors cheaper inference.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figs\u002Fscaling_curves.png\" alt=\"Accuracy-token scaling curves on held-out benchmarks\" width=\"100%\">\n\u003C\u002Fp>\n\n### Evolution of the discovery process\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figs\u002Fsearch_heldout_trajectory_beta1.png\" alt=\"Search and held-out trajectory under beta equals one\" width=\"100%\">\n\u003C\u002Fp>\n\nThe round-level trajectory (e.g., `t1 -> t5` in the figure above) shows a consistent move toward better objective values over the search process:\n\n- On the search benchmark, later rounds improve accuracy while keeping token growth controlled, indicating progressively better policy structure rather than random fluctuation.\n- On held-out benchmarks, the same trajectory remains competitive and often improves, suggesting that the discovered control logic transfers beyond the optimization split.\n- The trajectory reflects **objective-seeking code evolution without gradient updates**: the agent edits explicit controller programs, receives replay-based accuracy\u002Fcost feedback, and iteratively shifts behavior toward better empirical trade-offs.\n\nThis is a key point of AutoTTS: optimization is achieved through iterative program search in a fixed replay environment, not through backpropagation or parameter fine-tuning of the backbone model.\n\n---\n\n## Discovered controller: CMC\n\nThe discovered controller is named the **Confidence Momentum Controller (CMC)**. Its main mechanisms are:\n\n- **Trend-based stopping.** CMC maintains an exponential moving average of pool confidence and stops only when the confidence level is high and the trend is non-negative. This avoids stopping on transient confidence spikes.\n- **Coupled width–depth control.** Widening and deepening are linked through the EMA delta: strong confidence gains suppress new branch spawning, while stagnation or regression triggers widening.\n- **Alignment-aware depth allocation.** Branches whose latest answer matches the pool winner receive extra probe steps, concentrating compute on the emerging consensus while still advancing active branches.\n- **Conservative branch abandonment.** A branch is abandoned only after persistently deviating for multiple rounds, and at least two active branches are preserved.\n\nThese mechanisms are implemented as code-defined controller logic and evaluated through the same replay environment as the handcrafted baselines.\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Show full \u003Ccode>OptimalController\u003C\u002Fcode> source (CMC, click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nclass OptimalController(LLMDesignedMethod):\n    \"\"\"\n    Confidence Momentum Controller (CMC).\n\n    Core idea\n    ---------\n    All prior proposals (IBC, SCR, DGCC) share the same fundamental stopping\n    signal: \"instantaneous\" Beta-majority confidence computed from the\n    completed-answer pool at the current step.  This is susceptible to\n    single-step confidence spikes: a lucky early cluster of identical answers\n    can fire the gate prematurely before the distribution has stabilised.\n\n    CMC replaces the instantaneous confidence gate with a **momentum-aware**\n    gate:\n      - Track an exponential moving average (EMA) of pool confidence over\n        the last `T_ema` rounds: ema_conf = alpha * conf + (1 - alpha) * ema_conf\n      - Track the recent improvement delta: delta = ema_conf - ema_conf_prev\n      - Gate fires when BOTH of the following hold:\n          (a) ema_conf >= conf_thresh  (level requirement)\n          (b) delta >= -slack          (non-deteriorating momentum; slack is\n              a small tolerance that prevents stopping on a declining signal)\n      This means the controller cannot stop on a one-round spike; the EMA\n      must be high and not actively falling.\n\n    Adaptive depth allocation via probe-age priority\n    ------------------------------------------------\n    Each active unfinished branch tracks `probe_count` (how many probe steps\n    it has received).  In each round the controller allocates a per-round\n    probe budget of `probe_budget` steps distributed across active branches\n    using a **priority queue** sorted by probe_count descending.  The most-\n    invested branches get served first (up to `burst_senior` extra steps\n    each), then remaining budget goes to less-invested branches.\n    This concentrates depth on branches that are closest to completion while\n    still advancing younger branches, rather than uniform or purely aligned-\n    biased allocation (SCR) or lazy sleeping (DGCC).\n\n    Three-tier branch classification\n    ---------------------------------\n    After warm_up:\n      - \"aligned\":  latest answer == pool_winner\n      - \"deviant\":  latest answer != pool_winner, disagreed for >= 1 round\n      - \"neutral\":  no pool winner yet, or first round of disagreement\n    Tier affects the per-branch probe multiplier:\n      aligned  -> multiplier = `burst_aligned`  (e.g. 2 at high beta)\n      neutral  -> multiplier = 1\n      deviant  -> multiplier = 1, but if deviant for >= `abandon_patience`\n                  rounds the branch is abandoned\n\n    Confidence-trend widening\n    -------------------------\n    Widening (spawning new branches) is driven by whether the confidence\n    *trend* (delta) is positive and large, or weak\u002Fnegative:\n      - if delta > trend_thresh: confidence is accelerating -> no widening\n        (we're on track to stop soon)\n      - if delta \u003C= trend_thresh: plateau or regression -> widen by\n        `widen_burst` new branches, up to max_branch ceiling\n    This directly couples width decision to whether deepening is yielding\n    evidence-quality gains, a feedback loop not present in prior proposals.\n\n    Beta schedule\n    -------------\n    All hyperparameters are deterministic functions of a single beta in [0,1].\n    beta=0 -> conservative (few branches, low EMA inertia, easier to stop)\n    beta=1 -> near-full budget (many branches, high inertia, harder to stop)\n\n    Novelty vs prior work\n    ---------------------\n    ASC \u002F ESC: full reads; no incremental probing.\n    Parallel_Probe: fixed cohort; instantaneous majority; no pool\u002Fcompletion\n      distinction; no EMA.\n    IBC (r0001): instantaneous pool confidence gate; uniform 1-step probing;\n      1-branch-per-round widening; no EMA or trend.\n    SCR (r0002): asymmetric burst (aligned gets more steps); plateau-triggered\n      widening; instantaneous gate; no EMA.\n    DGCC (r0003): dual instantaneous gate (primary + soft corroboration);\n      lazy sleeping for locked branches; vote-gap proportional widening;\n      no EMA momentum.\n    CMC: replaces ALL instantaneous gates with a single EMA momentum gate;\n      introduces probe-age priority scheduling (neither uniform nor burst-\n      aligned-only); confidence-trend widening (neither plateau nor vote-gap);\n      three-tier classification is a natural simplification vs DGCC's dual\n      gate without adding extra hyperparameters.\n    \"\"\"\n\n    NAME = \"optimal_controller\"\n\n    _MAX_BRANCH   = 64\n    _MAX_OUTER    = 500\n\n    def _schedule(self, beta: float) -> dict:\n        \"\"\"\n        All schedules are smooth analytic functions of beta in [0,1].\n        Monotonicity:\n          - Parameters controlling budget use (n_init, max_branch_use,\n            burst_aligned, widen_burst, warm_up, abandon_patience, T_ema)\n            are NON-DECREASING in beta.\n          - conf_thresh is NON-DECREASING in beta (harder to stop -> more budget).\n          - trend_thresh is NON-INCREASING in beta (easier to trigger widening\n            at high beta -> more budget via wider exploration).\n          - ema_alpha is NON-INCREASING in beta (lower alpha = slower EMA =\n            more inertia = more budget at high beta).\n        \"\"\"\n        b = max(0.0, min(1.0, float(beta)))\n\n        n_init           = max(2, round(2  + 6  * b))\n        max_branch_use   = min(self._MAX_BRANCH, round(4 + 60 * b))\n        warm_up          = max(2, round(2  + 8  * b))\n        abandon_patience = max(3, round(3  + 9  * b))\n\n        T_ema            = max(2, round(2  + 6  * b))\n        ema_alpha        = 0.70 - 0.40 * b\n\n        conf_thresh      = 0.85 + 0.12 * b\n        delta_slack      = 0.04 - 0.03 * b\n\n        burst_aligned    = max(1, round(1 + 2 * b))\n\n        widen_burst      = max(1, round(1 + 3 * b))\n        trend_thresh     = 0.04 - 0.03 * b\n\n        min_complete     = max(2, round(2 + 3 * b))\n\n        return {\n            \"n_init\":           n_init,\n            \"max_branch_use\":   max_branch_use,\n            \"warm_up\":          warm_up,\n            \"abandon_patience\": abandon_patience,\n            \"T_ema\":            T_ema,\n            \"ema_alpha\":        round(ema_alpha, 4),\n            \"conf_thresh\":      round(conf_thresh, 4),\n            \"delta_slack\":      round(delta_slack, 4),\n            \"burst_aligned\":    burst_aligned,\n            \"widen_burst\":      widen_burst,\n            \"trend_thresh\":     round(trend_thresh, 4),\n            \"min_complete\":     min_complete,\n        }\n\n    def __init__(self, config: Optional[Dict[str, Any]] = None):\n        super().__init__(config)\n        self._beta            = float((config or {}).get(\"beta\", 0.5))\n        sched                 = self._schedule(self._beta)\n        self.n_init           = sched[\"n_init\"]\n        self.max_branch_use   = sched[\"max_branch_use\"]\n        self.warm_up          = sched[\"warm_up\"]\n        self.abandon_patience = sched[\"abandon_patience\"]\n        self.T_ema            = sched[\"T_ema\"]\n        self.ema_alpha        = sched[\"ema_alpha\"]\n        self.conf_thresh      = sched[\"conf_thresh\"]\n        self.delta_slack      = sched[\"delta_slack\"]\n        self.burst_aligned    = sched[\"burst_aligned\"]\n        self.widen_burst      = sched[\"widen_burst\"]\n        self.trend_thresh     = sched[\"trend_thresh\"]\n        self.min_complete     = sched[\"min_complete\"]\n        self.trace_recorder   = MethodTraceRecorder()\n\n    def _reset_trace(self) -> None:\n        self.trace_recorder = MethodTraceRecorder()\n\n    def _trace_step(\n        self,\n        *,\n        event: str,\n        goal: str,\n        step_input: Dict[str, Any],\n        step_output: Any,\n        state: Dict[str, Any],\n        decision: str,\n    ) -> None:\n        self.trace_recorder.add_step(\n            event=event,\n            goal=goal,\n            input=step_input,\n            output=step_output,\n            state=state,\n            decision=decision,\n        )\n\n    def get_last_trace(self) -> List[Dict[str, Any]]:\n        return self.trace_recorder.to_list()\n\n    def solve_with_trace(self, question) -> Dict[str, Any]:\n        answer = self.solve(question)\n        return {\"answer\": answer, \"trace\": self.get_last_trace()}\n\n    def _pool_stats(self, completed: List[str]):\n        \"\"\"(winner, top1, top2, conf) over completed-answer pool.\"\"\"\n        if not completed:\n            return None, 0, 0, 0.0\n        winner, top1, top2, _ = _vote_stats(completed)\n        conf = _beta_majority_confidence(top1, top2)\n        return winner, top1, top2, conf\n\n    def _update_ema(self, ema_prev: float, new_val: float) -> float:\n        \"\"\"EMA update: ema = (1 - alpha) * ema_prev + alpha * new_val.\"\"\"\n        return (1.0 - self.ema_alpha) * ema_prev + self.ema_alpha * new_val\n\n    def _classify_branch(\n        self,\n        br: Dict[str, Any],\n        pool_winner,\n        warm_enough: bool,\n    ) -> str:\n        if not warm_enough or pool_winner is None:\n            return \"neutral\"\n        if br[\"latest_ans\"] == pool_winner:\n            return \"aligned\"\n        return \"deviant\"\n\n    def _probe_branch(\n        self,\n        question,\n        br: Dict[str, Any],\n        completed_answers: List[str],\n        n_steps: int,\n    ) -> None:\n        \"\"\"Probe branch br for up to n_steps steps; record completions.\"\"\"\n        for _ in range(n_steps):\n            if br[\"finished\"]:\n                break\n            out = _safe_probe_more(question, br[\"index\"])\n            if out is None:\n                br[\"finished\"] = True\n                if br[\"latest_ans\"] is not None:\n                    completed_answers.append(br[\"latest_ans\"])\n                break\n            new_ans, is_finish = out\n            br[\"probe_count\"] += 1\n            br[\"latest_ans\"] = new_ans\n            br[\"finished\"] = is_finish\n            if is_finish:\n                completed_answers.append(new_ans)\n                break\n\n    def solve(self, question) -> Optional[str]:\n        self._reset_trace()\n        self._trace_step(\n            event=\"start\",\n            goal=\"initialize CMC run\",\n            step_input={\"beta\": self._beta},\n            step_output=\"initialized\",\n            state={\n                \"n_init\":           self.n_init,\n                \"max_branch_use\":   self.max_branch_use,\n                \"warm_up\":          self.warm_up,\n                \"abandon_patience\": self.abandon_patience,\n                \"T_ema\":            self.T_ema,\n                \"ema_alpha\":        self.ema_alpha,\n                \"conf_thresh\":      self.conf_thresh,\n                \"delta_slack\":      self.delta_slack,\n                \"burst_aligned\":    self.burst_aligned,\n                \"widen_burst\":      self.widen_burst,\n                \"trend_thresh\":     self.trend_thresh,\n                \"min_complete\":     self.min_complete,\n            },\n            decision=\"start confidence momentum controller\",\n        )\n\n        # Branch state:\n        #   index          : stable branch_index from probe_new\n        #   latest_ans     : current answer (intermediate or final)\n        #   finished       : bool — branch exhausted its full budget\n        #   abandoned      : bool — dropped due to persistent deviance\n        #   probe_count    : number of probe_more steps received\n        #   disagree_rounds: consecutive rounds where answer != pool_winner\n        branches: List[Dict[str, Any]] = []\n        completed_answers: List[str] = []\n        total_spawned = 0\n\n        # ---- Phase 0: open n_init branches ----\n        for _ in range(self.n_init):\n            out = _safe_probe_new(question)\n            if out is None:\n                break\n            ans, idx, is_finish = out\n            total_spawned += 1\n            br: Dict[str, Any] = {\n                \"index\":           idx,\n                \"latest_ans\":      ans,\n                \"finished\":        is_finish,\n                \"abandoned\":       False,\n                \"probe_count\":     0,\n                \"disagree_rounds\": 0,\n            }\n            branches.append(br)\n            if is_finish:\n                completed_answers.append(ans)\n\n        self._trace_step(\n            event=\"init_branches\",\n            goal=\"open initial branch batch\",\n            step_input={\"n_init\": self.n_init},\n            step_output={\n                \"n_spawned\":   total_spawned,\n                \"n_completed\": len(completed_answers),\n            },\n            state={\"total_spawned\": total_spawned},\n            decision=\"proceed to main loop\",\n        )\n\n        if not branches:\n            self._trace_step(\n                event=\"finish\",\n                goal=\"return final answer\",\n                step_input={},\n                step_output={\"answer\": None, \"stop_reason\": \"no_branches\"},\n                state={\"total_spawned\": 0},\n                decision=\"no branches available\",\n            )\n            return None\n\n        # EMA state — initialised to 0 (no evidence yet)\n        ema_conf       = 0.0\n        ema_conf_prev  = 0.0\n        ema_history: List[float] = []\n\n        outer_step = 0\n\n        while outer_step \u003C self._MAX_OUTER:\n\n            # ---- Compute current pool stats ----\n            pool_winner, top1, top2, pool_conf = self._pool_stats(completed_answers)\n            n_complete = len(completed_answers)\n            warm_enough = (outer_step >= self.warm_up)\n\n            # ---- Update EMA ----\n            ema_conf_prev = ema_conf\n            ema_conf = self._update_ema(ema_conf, pool_conf)\n            ema_history.append(ema_conf)\n            if len(ema_history) > self.T_ema:\n                ema_history.pop(0)\n\n            if len(ema_history) >= 2:\n                ema_delta = ema_history[-1] - ema_history[0]\n            else:\n                ema_delta = 0.0\n\n            # ---- Classify branches and update disagree_rounds ----\n            if warm_enough and pool_winner is not None:\n                for br in branches:\n                    if br[\"abandoned\"] or br[\"finished\"]:\n                        continue\n                    tier = self._classify_branch(br, pool_winner, warm_enough)\n                    if tier == \"deviant\":\n                        br[\"disagree_rounds\"] += 1\n                    else:\n                        br[\"disagree_rounds\"] = 0\n\n            # ---- Abandon persistently deviant branches (keep >= 2 alive) ----\n            abandoned_this: List[int] = []\n            if warm_enough and pool_winner is not None:\n                n_alive = sum(\n                    1 for br in branches\n                    if not br[\"abandoned\"] and not br[\"finished\"]\n                )\n                cands = sorted(\n                    [\n                        br for br in branches\n                        if not br[\"abandoned\"]\n                        and not br[\"finished\"]\n                        and br[\"disagree_rounds\"] >= self.abandon_patience\n                    ],\n                    key=lambda b: -b[\"disagree_rounds\"],\n                )\n                max_abandon = max(0, n_alive - 2)\n                for br in cands[:max_abandon]:\n                    br[\"abandoned\"] = True\n                    abandoned_this.append(br[\"index\"])\n\n            # ---- Prioritised depth allocation ----\n            active_brs = [\n                br for br in branches\n                if not br[\"abandoned\"] and not br[\"finished\"]\n            ]\n            active_brs_sorted = sorted(active_brs, key=lambda b: -b[\"probe_count\"])\n\n            probed_this: int = 0\n            for br in active_brs_sorted:\n                tier = self._classify_branch(br, pool_winner, warm_enough)\n                n_steps = self.burst_aligned if tier == \"aligned\" else 1\n                self._probe_branch(question, br, completed_answers, n_steps)\n                probed_this += n_steps\n\n            pool_winner, top1, top2, pool_conf = self._pool_stats(completed_answers)\n            n_complete = len(completed_answers)\n\n            ema_conf = self._update_ema(ema_conf, pool_conf)\n            if ema_history:\n                ema_history[-1] = ema_conf\n            if len(ema_history) >= 2:\n                ema_delta = ema_history[-1] - ema_history[0]\n            else:\n                ema_delta = 0.0\n\n            n_active = sum(\n                1 for br in branches if not br[\"abandoned\"] and not br[\"finished\"]\n            )\n\n            self._trace_step(\n                event=\"forward\",\n                goal=\"probe with priority scheduling + update EMA\",\n                step_input={\n                    \"outer_step\":  outer_step,\n                    \"pool_winner\": pool_winner,\n                    \"pool_conf\":   round(pool_conf, 4),\n                },\n                step_output={\n                    \"n_complete\":    n_complete,\n                    \"n_active\":      n_active,\n                    \"probed_this\":   probed_this,\n                    \"ema_conf\":      round(ema_conf, 4),\n                    \"ema_delta\":     round(ema_delta, 4),\n                    \"abandoned_now\": abandoned_this,\n                },\n                state={\"total_spawned\": total_spawned},\n                decision=\"evaluate momentum gate and widening\",\n            )\n\n            # ---- EMA momentum stopping gate ----\n            gate_eligible = (\n                warm_enough\n                and n_complete >= self.min_complete\n            )\n            gate_fires = (\n                gate_eligible\n                and ema_conf >= self.conf_thresh\n                and ema_delta >= -self.delta_slack\n            )\n\n            self._trace_step(\n                event=\"terminate_check\",\n                goal=\"EMA momentum gate evaluation\",\n                step_input={\n                    \"outer_step\":   outer_step,\n                    \"conf_thresh\":  self.conf_thresh,\n                    \"delta_slack\":  self.delta_slack,\n                    \"min_complete\": self.min_complete,\n                    \"warm_up\":      self.warm_up,\n                },\n                step_output={\n                    \"ema_conf\":      round(ema_conf, 4),\n                    \"ema_delta\":     round(ema_delta, 4),\n                    \"pool_conf\":     round(pool_conf, 4),\n                    \"n_complete\":    n_complete,\n                    \"gate_eligible\": gate_eligible,\n                    \"gate_fires\":    gate_fires,\n                },\n                state={\"total_spawned\": total_spawned},\n                decision=\"stop if EMA gate fires\",\n            )\n\n            if gate_fires:\n                self._trace_step(\n                    event=\"finish\",\n                    goal=\"return final answer\",\n                    step_input={\"outer_step\": outer_step},\n                    step_output={\n                        \"answer\":      pool_winner,\n                        \"stop_reason\": \"ema_momentum_gate\",\n                        \"ema_conf\":    round(ema_conf, 4),\n                        \"ema_delta\":   round(ema_delta, 4),\n                        \"n_complete\":  n_complete,\n                    },\n                    state={\"total_spawned\": total_spawned},\n                    decision=\"EMA level high + momentum non-negative\",\n                )\n                return pool_winner\n\n            # ---- All branches resolved? ----\n            all_resolved = all(br[\"finished\"] or br[\"abandoned\"] for br in branches)\n            if all_resolved:\n                break\n\n            # ---- Confidence-trend widening ----\n            can_widen = (\n                total_spawned \u003C self.max_branch_use\n                and total_spawned \u003C self._MAX_BRANCH\n            )\n            trend_weak = ema_delta \u003C= self.trend_thresh\n            want_widen = (\n                can_widen\n                and trend_weak\n                and outer_step >= max(1, self.warm_up \u002F\u002F 2)\n                and ema_conf \u003C self.conf_thresh\n            )\n\n            spawned_now = 0\n            if want_widen:\n                for _ in range(self.widen_burst):\n                    if total_spawned >= self.max_branch_use:\n                        break\n                    if total_spawned >= self._MAX_BRANCH:\n                        break\n                    out = _safe_probe_new(question)\n                    if out is None:\n                        break\n                    ans, idx, is_finish = out\n                    total_spawned += 1\n                    spawned_now += 1\n                    br_new: Dict[str, Any] = {\n                        \"index\":           idx,\n                        \"latest_ans\":      ans,\n                        \"finished\":        is_finish,\n                        \"abandoned\":       False,\n                        \"probe_count\":     0,\n                        \"disagree_rounds\": 0,\n                    }\n                    branches.append(br_new)\n                    if is_finish:\n                        completed_answers.append(ans)\n\n            self._trace_step(\n                event=\"update_states\",\n                goal=\"confidence-trend widening snapshot\",\n                step_input={\n                    \"outer_step\":   outer_step,\n                    \"want_widen\":   want_widen,\n                    \"ema_conf\":     round(ema_conf, 4),\n                    \"ema_delta\":    round(ema_delta, 4),\n                    \"trend_thresh\": self.trend_thresh,\n                },\n                step_output={\n                    \"spawned_now\":   spawned_now,\n                    \"total_spawned\": total_spawned,\n                    \"all_resolved\":  all_resolved,\n                },\n                state={\"n_active\": n_active},\n                decision=\"continue main loop\",\n            )\n\n            outer_step += 1\n\n        # ---- Final answer ----\n        final_winner, _, _, final_conf = self._pool_stats(completed_answers)\n        if final_winner is None:\n            all_latest = [\n                br[\"latest_ans\"]\n                for br in branches\n                if not br[\"abandoned\"] and br[\"latest_ans\"] is not None\n            ]\n            final_winner = _majority_answer(all_latest)\n            final_conf = 0.0\n\n        self._trace_step(\n            event=\"finish\",\n            goal=\"return final answer\",\n            step_input={\"outer_step\": outer_step},\n            step_output={\n                \"answer\":        final_winner,\n                \"stop_reason\":   \"loop_end\",\n                \"ema_conf\":      round(ema_conf, 4),\n                \"pool_conf\":     round(final_conf, 4),\n                \"n_complete\":    len(completed_answers),\n                \"total_spawned\": total_spawned,\n            },\n            state={\"total_spawned\": total_spawned},\n            decision=\"majority of completed answers at loop end\",\n        )\n        return final_winner\n```\n\nThe same source also lives in [`efficient_reasoning_controller\u002Fworkspace\u002Fcode_base\u002Fmethod.py`](efficient_reasoning_controller\u002Fworkspace\u002Fcode_base\u002Fmethod.py).\n\n\u003C\u002Fdetails>\n\n---\n\n## Repository structure\n\n```text\nAutoTTS\u002F\n└── efficient_reasoning_controller\u002F\n    ├── eval\u002F                         # evaluation\n    ├── logs\u002Fsearch_history\u002F          # Archived discovery rounds (optional method.py sources)\n    ├── workspace\u002F\n    │   ├── code_base\u002F\n    │   │   ├── data_loader.py          # Replay environment (Question \u002F Branch \u002F ModelandTask)\n    │   │   ├── method.py               # Active controller implementations\n    │   │   ├── method.template.py      # Template that method.py is reset from each round\n    │   │   ├── eval.py                 # Main evaluation entry point (matrix sweep)\n    │   │   ├── evaluator.py            # Helper evaluation APIs\n    │   │   ├── controller_api.py       # Controller base interface\n    │   │   ├── trace_schema.py         # Per-step \u002F per-problem trace schema\n    │   │   ├── environment\u002F            # Search-set replay data (per model)\n    │   │   └── history\u002F                # Seed baseline results + archived search rounds\n    │   └── controller_search\u002F\n    │       ├── run_workflow.sh         # Launch the multi-round controller search\n    │       ├── workflow_propose_critic.py\n    │       ├── claude_proposer.py\n    │       ├── codex_proposer.py\n    │       └── prompts\u002F\n    └── test_environment\u002F               # Held-out replay data (do not expose to proposer)\n```\n\n---\n\n## Install\n\nDepending on [how you reproduce results](#reproduction):\n\n- **Evaluate our controllers only** — create the [Conda environment](#conda-environment) and install `numpy`, `pandas`, `tqdm` (see below). No Node.js, Claude CLI, or API keys are required for replay evaluation.\n- **Run discovery yourself** — complete all subsections: Conda, [Claude environment setup](#claude-environment-setup), and [API environment setup](#api-environment-setup).\n\n### Conda environment\n\n```bash\nconda create -n autotts python=3.12 -y\nconda activate autotts\n```\n\n### Claude environment setup\n\n```bash\ncurl -o- https:\u002F\u002Fraw.githubusercontent.com\u002Fnvm-sh\u002Fnvm\u002Fv0.39.7\u002Finstall.sh | bash\n\nsource ~\u002F.bashrc\n\nnvm install 21\n\nnpm install -g @anthropic-ai\u002Fclaude-code\n\npip install claude-agent-sdk==0.1.58\n\npip install numpy pandas tqdm\n```\n\n### API environment setup\n\n```bash\ncat >> ~\u002F.bashrc \u003C\u003C'EOF'\nexport OPENROUTER_API_KEY=\"your_openrouter_api_key\"\n\nexport ANTHROPIC_BASE_URL=\"https:\u002F\u002Fopenrouter.ai\u002Fapi\"\nexport ANTHROPIC_AUTH_TOKEN=\"$OPENROUTER_API_KEY\"\nexport ANTHROPIC_API_KEY=\"\"\n\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"anthropic\u002Fclaude-sonnet-4.6\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"anthropic\u002Fclaude-opus-4.6\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"anthropic\u002Fclaude-haiku-4.5\"\nexport CLAUDE_CODE_SUBAGENT_MODEL=\"anthropic\u002Fclaude-opus-4.6\"\nexport CLAUDE_CODE_SKIP_FAST_MODE_ORG_CHECK=1\nEOF\n\nsource ~\u002F.bashrc\n```\n\n---\n\n## Reproduction\n\nThere are **two supported workflows**:\n\n| | **Goal** | **Needs API \u002F Claude tooling?** |\n|---|-----------|-----------------------------------|\n| **Way A** | Evaluate **released or archived** TTS controller programs (`method.py`) on our replay splits | No — replay-only |\n| **Way B** | **Run controller discovery yourself** (multi-round propose → critic → eval) | Yes — follow full [Install](#install) |\n\nComplete [Install](#install) before Way B. Way A only requires the Conda setup and `numpy` \u002F `pandas` \u002F `tqdm`.\n\n### Way A — Evaluate our programs (`eval\u002F`)\n\nUse this when you want tables and traces on the bundled replay data **without** launching search.\n\n1. **Controller code.** The repo ships a working [`efficient_reasoning_controller\u002Feval\u002Fmethod.py`](efficient_reasoning_controller\u002Feval\u002Fmethod.py). To evaluate a **specific snapshot** from our search logs, copy it over that file, e.g. from [`logs\u002Fsearch_history\u002F\u003Crun>\u002Fcode_base\u002Fmethod.py`](efficient_reasoning_controller\u002Flogs\u002Fsearch_history) (paths may vary by release layout).\n2. **Configure sweeps.** Edit models, datasets, and method lists at the top of [`eval\u002Feval.py`](efficient_reasoning_controller\u002Feval\u002Feval.py).\n3. **Run evaluation** from the repository root (or use `cd AutoTTS\u002Fefficient_reasoning_controller` if you are one level above the checkout):\n\n```bash\ncd efficient_reasoning_controller\npython eval\u002Feval.py\n```\n\n4. **Outputs** land under **`eval\u002Ftest_results\u002F`**, e.g. `eval\u002Ftest_results\u002Fmatrix_results_\u003CMODEL>\u002F` with `\u003CDATASET>_raw_new_api.csv` and `\u003CDATASET>_trace_new_api.jsonl`.\n\nDiscovery evaluation inside the research codebase uses the same logic under [`workspace\u002Fcode_base\u002Feval.py`](efficient_reasoning_controller\u002Fworkspace\u002Fcode_base\u002Feval.py); it writes to `code_base\u002Ftraining_results\u002F` instead. Use **`eval\u002F`** for the standalone “evaluate what we ship” layout.\n\n### Way B — Run discovery yourself (`workspace\u002F`)\n\nUse this to reproduce or extend the **automated search loop** (costs LLM calls; evaluation steps remain replay-only).\n\n1. **Environment.** Finish [Install](#install) (Conda + nvm\u002FNode + `claude-agent-sdk` + API exports). Authenticate the Claude Code CLI (`claude login`) as needed.\n2. **Set up History**: Download History from huggingface  (as exec trace is very large)\n```bash\nhuggingface-cli download AutoTTS\u002Fhistory --local-dir .\u002Fhistory\ncp -r .\u002Fhistory efficient_reasoning_controller\u002Fworkspace\u002Fcode_base\u002F   # replace history directory with the full hisotry \n```\n4. **Launch the workflow:**\n\n```bash\ncd efficient_reasoning_controller\u002Fworkspace\nbash controller_search\u002Frun_workflow.sh\n```\n\n5. **Optional tuning** via environment variables (defaults in [`run_workflow.sh`](efficient_reasoning_controller\u002Fworkspace\u002Fcontroller_search\u002Frun_workflow.sh)):\n\n```bash\nexport WORKFLOW_PROPOSER_BACKEND=claude   # claude or codex\nexport WORKFLOW_ROUNDS=5\nexport WORKFLOW_EVAL_CMD=\"python code_base\u002Feval.py\"\nexport WORKFLOW_RESUME=1\n```\n\nEach round writes a snapshot under:\n\n```text\ncode_base\u002Fhistory\u002FrNNNN_\u003Ctimestamp>_\u003Cuid>\u002F\n├── method.py                 # OptimalController produced this round\n└── proposal_results\u002F         # CSVs + trace JSONL for this round\n```\n\n`code_base\u002Fmethod.py` is reset from `code_base\u002Fmethod.template.py` at the start of every round; each candidate must be self-contained in `method.py`.\n\n**Evaluation during search.** `WORKFLOW_EVAL_CMD` defaults to `python code_base\u002Feval.py`; matrices appear under `code_base\u002Ftraining_results\u002F`:\n\n```text\ncode_base\u002Ftraining_results\u002F\n└── matrix_results_\u003CMODEL>\u002F\n    ├── \u003CDATASET>_raw_new_api.csv\n    └── \u003CDATASET>_trace_new_api.jsonl\n```\n\n**After discovery.** Copy any round’s `method.py` into [`eval\u002Fmethod.py`](efficient_reasoning_controller\u002Feval\u002Fmethod.py) and follow **Path A** for a standalone rerun under `eval\u002Ftest_results\u002F`.\n\n---\n\n## Built-in baselines\n\n`code_base\u002Fmethod.py` ships:\n\n- `ASCMethod` — adaptive self-consistency with Beta-confidence early stopping.\n- `ESCMethod` — early stopping by sliding-window answer consistency.\n- `Parallel_Probe` — parallel chains with warm-up, off-track pruning, and stable-majority termination.\n- `OptimalController` — the target class rewritten by the search workflow (e.g. CMC).\n\nPre-computed seed baseline results are stored under:\n\n```text\nefficient_reasoning_controller\u002Fworkspace\u002Fcode_base\u002Fhistory\u002Fseed_algorithms\u002F\n```\n\n---\n\n## Citation\n\n```bibtex\n@article{zheng2026llms,\n  title={LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling},\n  author={Zheng, Tong and Liu, Haolin and Huang, Chengsong and Bao, Huiwen and Zhang, Sheng and Liu, Rui and Dai, Runpeng and Chen, Ruibo and Liu, Chenxi and Xiong, Tianyi and others},\n  journal={arXiv preprint arXiv:2605.08083},\n  year={2026}\n}\n\n@article{zheng2026parallel,\n  title={Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing},\n  author={Zheng, Tong and Huang, Chengsong and Dai, Runpeng and He, Yun and Liu, Rui and Ni, Xin and Bao, Huiwen and Wang, Kaishen and Zhu, Hongtu and Huang, Jiaxin and others},\n  journal={arXiv preprint arXiv:2602.03845},\n  year={2026}\n}\n```\n","AutoTTS 是一个用于改进语言模型在测试时性能的自动搜索系统。该项目通过构建一个离线回放环境，让编码代理在其中迭代地提出和优化代码定义的控制器，从而将传统的手工设计策略转变为环境驱动的自动化搜索过程。其技术特点包括无需进行梯度更新、完全基于代码编辑，并且在发现过程中不需要调用任何大型语言模型，极大地降低了成本。AutoTTS 适用于需要提高语言模型推理效率同时保持准确性的场景，如大规模文本生成或分析任务中，能够显著减少计算资源消耗并缩短处理时间。",2,"2026-06-11 03:52:58","CREATED_QUERY"]