[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82832":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":14,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":12,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},82832,"nanoRL","ethanhe42\u002FnanoRL","ethanhe42",null,"Python",117,6,1,0,12,26,2.54,"MIT License",false,"main",[],"2026-06-12 02:04:28","# nanoRL\n\nMinimal, single-file implementations of the four most common ways to fine-tune a language model after pretraining:\n\n- **SFT** — supervised next-token prediction on demonstrations\n- **DPO** — direct preference optimization on (chosen, rejected) pairs\n- **GRPO** — group relative policy optimization (PPO without a critic, used in DeepSeek-R1)\n- **PPO** — proximal policy optimization with a separate critic transformer (the InstructGPT setup)\n\nEach file is ~100-180 lines, self-contained, and converges on a toy arithmetic task in 30 steps on a single GPU (or an M-series Mac via MPS). The goal is to make these algorithms readable end-to-end, in the spirit of [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT). Three companion files then scale GRPO from the toy task to [GSM8K](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgrade-school-math) — including the result of an autonomous *autoresearch* loop that ran 82 tuning experiments overnight ([more below](#scaling-up-grpo-on-gsm8k)).\n\n## Why four files?\n\nThese four algorithms occupy different points on the same axis: **what kind of supervision do you have?**\n\n| | Supervision needed | Models loaded | Loss |\n|---|---|---|---|\n| [SFT](minimal_sft.py)   | Demonstrations `(prompt, target)` | 1 (policy) | Cross-entropy on target tokens |\n| [DPO](minimal_dpo.py)   | Preferences `(prompt, chosen, rejected)` | 2 (policy, ref) | `−log σ(β · (Δ log ratio))` |\n| [GRPO](minimal_grpo.py) | Reward function `R(completion)` | 2 (policy, ref) | Clipped surrogate, group-mean baseline |\n| [PPO](minimal_ppo.py)   | Reward function `R(completion)` | 3 (policy, ref, critic) | Clipped surrogate, learned V baseline + GAE |\n\nReading them in this order shows how each algorithm adds machinery to handle weaker supervision: SFT needs full demonstrations, DPO needs only preferences, GRPO\u002FPPO need only a reward signal.\n\n## Install\n\n```bash\nuv sync\n```\n\n(or `pip install -e .` if you don't have [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) — it's a fast Python package manager and recommended.)\n\n## Run\n\nEach file is independent. Just run it:\n\n```bash\nuv run minimal_sft.py\nuv run minimal_dpo.py\nuv run minimal_grpo.py\nuv run minimal_ppo.py\n```\n\nThe toy task is binary-reward 1-digit arithmetic: given a prompt like *\"What is 3 + 8?\"*, the model should output `\u003Canswer>11\u003C\u002Fanswer>`. Each script:\n1. Loads `Qwen\u002FQwen2.5-0.5B-Instruct` (or one + a copy for ref\u002Fcritic).\n2. Trains for 30 steps on the toy dataset.\n3. Prints loss\u002Freward\u002Fgrad-norm per step.\n\nAll four converge to the correct format\u002Fanswer within 30 steps.\n\n### Expected output\n\nSampling is unseeded, so exact numbers vary between runs, but the shape is consistent (Qwen2.5-0.5B-Instruct on an M-series Mac):\n\n**SFT** — the demonstration's NLL collapses to ~0 within a couple of steps:\n\n```\n---------- step 0 ----------\nprompt: What is 5 + 7?  completion: \u003Canswer>12\u003C\u002Fanswer>\nloss=0.02 seq_logp=-0.16 grad_norm=18.75\n---------- step 1 ----------\nprompt: What is 3 + 8?  completion: \u003Canswer>11\u003C\u002Fanswer>\nloss=0.00 seq_logp=-0.01 grad_norm=1.45\n...\n```\n\n**DPO** — `accuracy` is 1.0 from the start; the chosen log-prob holds while the rejected one is driven toward −∞ (the over-optimization quirk noted below):\n\n```\n---------- step 0 ----------\nprompt: What is 3 + 8?  chosen: \u003Canswer>11\u003C\u002Fanswer>  rejected: \u003Canswer>10\u003C\u002Fanswer>\nloss=0.69 chosen_logp=-0.29 rejected_logp=-8.54 reward_gap=0.01 accuracy=1.00 grad_norm=73.00\n...\n---------- step 29 ----------\nprompt: What is 1 + 1?  chosen: \u003Canswer>2\u003C\u002Fanswer>  rejected: \u003Canswer>3\u003C\u002Fanswer>\nloss=0.04 chosen_logp=-0.01 rejected_logp=-41.00 reward_gap=3.25 accuracy=1.00 grad_norm=3.98\n```\n\n**GRPO** — group reward climbs to 1.0 as the rollouts settle on the `\u003Canswer>…\u003C\u002Fanswer>` format:\n\n```\n---------- step 0 ----------\nwrite answer inside \u003Canswer>\u003C\u002Fanswer>\nWhat is 1 + 1? The answer to 1 + 1 is two. ... \u003Canswer>two\nloss=0.00 kl=-0.08 reward=0.00 ratio=0.95 clipped_frac=0.28 advantage_abs=0.00 grad_norm=1.01\n...\n---------- step 29 ----------\nWhat is 1 + 1? \u003Canswer>2\u003C\u002Fanswer>\nloss=0.00 kl=0.35 reward=1.00 ratio=0.99 clipped_frac=0.00 advantage_abs=0.00 grad_norm=0.03\n```\n\n**PPO** — noisier at batch size 1 (the critic learns V(s) from scratch), but reward reaches 1.0 on most steps:\n\n```\n---------- step 0 ----------\nwrite answer inside \u003Canswer>\u003C\u002Fanswer>\nWhat is 4 + 4? 8 ... \u003Canswer>8\u003C\u002Fanswer>\npolicy_loss=-0.19 value_loss=0.49 reward=1.00 ratio=1.02 clipped_frac=0.06 advantage_abs=0.18 grad_norm=43.00\n...\n---------- step 20 ----------\nWhat is 4 + 4? \u003Canswer>8\u003C\u002Fanswer> ...\npolicy_loss=-0.25 value_loss=0.30 reward=1.00 ratio=1.00 clipped_frac=0.00 advantage_abs=0.25 grad_norm=41.00\n```\n\n## Algorithmic walk-through\n\n### SFT — the baseline\n\nThe simplest fine-tuning loss: maximize the log probability of the demonstration completion under the policy. Just masked cross-entropy on the completion tokens.\n\n```python\nloss = −(log π_θ(target | prompt) · completion_mask).sum() \u002F completion_mask.sum()\n```\n\nNo reference model, no reward, no preferences. You need full demonstrations (someone has to *write* the desired output). The other three algorithms exist precisely because demonstrations are expensive and we'd rather use weaker signals.\n\n### DPO — learning from preferences\n\nGiven pairs of responses where a human (or a heuristic) says \"A is better than B\", DPO trains the policy to assign higher probability to A than B, *relative to a frozen reference model*. The loss is:\n\n```\nL = −log σ( β · ( log π_θ(y_w|x)\u002Fπ_ref(y_w|x) − log π_θ(y_l|x)\u002Fπ_ref(y_l|x) ) )\n```\n\nThis is mathematically equivalent to fitting a Bradley-Terry reward model and then doing RL against it — but without ever instantiating the reward model. One forward pass per example through each of two models; no rollouts. The simplest entry point into \"learning from feedback.\"\n\n**Caveat:** the loss has no floor — rejected log-probs can be pushed to −∞ even when chosen is already saturated. The \"rejected_logp = −39\" pattern you'll see in the run is real and a known DPO over-optimization quirk. IPO and KTO are follow-up algorithms that fix this.\n\n### GRPO — RL without a critic\n\nGRPO is what DeepSeek-R1 used. For each prompt, generate a *group* of G rollouts, score them with a (verifiable, rule-based, or learned) reward function, and compute the advantage as the group-relative reward:\n\n```\nadvantage_i = (R_i − mean(R)) \u002F std(R)    # within each group of G rollouts\n```\n\nThat's it for the baseline — no value model, no GAE. The advantage is broadcast to every token of trajectory i, then plugged into PPO's clipped surrogate. GRPO is essentially \"PPO with the value model replaced by a group mean.\"\n\nThis works beautifully when:\n- Rewards are clean (verifiable: math, code) — group variance is small relative to between-prompt variance.\n- You can afford ≥4 rollouts per prompt.\n\nIt struggles when all rollouts in a group agree (variance → 0, no learning signal) and when reward is noisy\u002Fcontinuous (a learned V baseline would help).\n\n### PPO — the canonical RLHF setup\n\nThe big-machine version. Four models in production (here we omit the reward model since the toy uses a hard-coded `reward_fn`):\n\n- **Policy** — what we're training.\n- **Reference** — frozen pre-RL snapshot, used for KL regularization (omitted in our minimal version since `β=0`).\n- **Critic** — a *separate transformer* that learns V(s) end-to-end, initialized fresh here (in production, initialize from the reward model).\n- **Reward model** — produces the terminal reward; we use ground-truth `pred == answer` instead.\n\nThe advantage uses GAE (Generalized Advantage Estimation, λ=0.95):\n\n```\nδ_t = r_t + γ V(s_{t+1}) − V(s_t)\nA_t = δ_t + γλ A_{t+1}    # backward recursion\n```\n\nGAE is a tunable bias-variance dial: λ=1 collapses to Monte Carlo (unbiased, high variance), λ=0 is one-step TD (biased, low variance). 0.95 is the empirical default from Schulman 2016.\n\n**Trade-offs vs GRPO:** PPO uses a per-state V baseline that accumulates knowledge across batches — useful for noisy rewards or sparse signals. The cost is an extra full transformer in memory and a more fragile training loop (the critic can over-fit or interfere with the policy). For toy verifiable-reward tasks, GRPO is the right tool. For production RLHF with continuous reward models, PPO is the historical workhorse.\n\n## Hyperparameters (where they're load-bearing)\n\nAll four files use:\n- `Qwen\u002FQwen2.5-0.5B-Instruct` (smallest practical Qwen; can swap for `google\u002Fgemma-3-270m-it` for even smaller).\n- `max_seqlen = 64`, `max_new_tokens = 32`.\n- `batch_size = 1` (one prompt per step).\n- 30 outer steps.\n\nAlgorithm-specific knobs that matter:\n\n| | knob | value | why |\n|---|---|---|---|\n| SFT  | lr | 5e-6 | Direct supervision is forgiving; bigger lrs work too. |\n| DPO  | β | 0.1 | KL strength on the implicit reward. Higher = sharper preference. |\n| GRPO | G (rollouts) | 4 | Need ≥2 for within-group variance; 4 is the typical default. |\n| GRPO | β (KL) | 0.04 | Mild pull toward reference. |\n| PPO  | lr (policy) | 1e-6 | LLM softmax-over-vocab amplifies small weight changes; must stay small. |\n| PPO  | grad_clip | 1.0 | Tight clip is load-bearing — first-step gradients can explode without it. |\n| PPO  | V init | zeros | V = 0 means failures produce zero gradient (no destructive drift at batch=1). |\n| PPO  | γ, λ | 1.0, 0.95 | Standard GAE settings. |\n\nThe trickiest one is PPO at small batch size — see the design notes inside [minimal_ppo.py](minimal_ppo.py) for why each choice is what it is.\n\n## Scaling up: GRPO on GSM8K\n\nThe toy task fits in 64 tokens. [GSM8K](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgrade-school-math) is the real thing — grade-school math word problems with free-form chain-of-thought, scored by a verifiable final-answer reward. Three files apply the same GRPO machinery to it with `Qwen\u002FQwen2.5-0.5B-Instruct`:\n\n- **[`gsm8k_grpo.py`](gsm8k_grpo.py)** — the textbook setup: reference model + KL penalty + PPO clip, with batched generation and eval. Evaluates the base model, trains, evaluates again. The direct scale-up of [`minimal_grpo.py`](minimal_grpo.py).\n- **[`gsm8k_sft_grpo.py`](gsm8k_sft_grpo.py)** — the standard RLVR pipeline: a short SFT warm-up on gold solutions, *then* GRPO, with three eval checkpoints (base → after SFT → after GRPO) so you can see what each phase contributes.\n- **[`gsm8k_grpo_autoresearch.py`](gsm8k_grpo_autoresearch.py)** — the output of the autoresearch loop below.\n\n```bash\nuv run gsm8k_grpo.py\nuv run gsm8k_sft_grpo.py\n```\n\nFirst run downloads the GSM8K dataset (~10MB) and the 0.5B model (~1GB); training + eval takes minutes, not the seconds the toy scripts take.\n\n### Autoresearch: what 82 experiments converged to\n\n[`gsm8k_grpo_autoresearch.py`](gsm8k_grpo_autoresearch.py) is the result of an overnight experiment. An autonomous agent, following the protocol in [`program.md`](program.md) (inspired by [karpathy\u002Fautoresearch](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fautoresearch)), edited the GRPO file one change at a time, ran a fixed 20-minute training budget, evaluated `val_acc` on `TEST[:200]`, then kept or reverted the change — and repeated, unattended. 82 experiments; the full log is in [`results.tsv`](results.tsv).\n\n![GSM8K GRPO autoresearch progression: val_acc over 82 experiments](autoresearch.png)\n\nTwo things came out of it:\n\n1. **The noise floor is wider than it looks.** The best run hit 0.39 against a ~0.34 baseline, but re-running the *same* config gave 0.32 and 0.35 — the real spread at N=200 is about ±7pp, not the ±2pp a sampling estimate suggests. Most apparent wins (cosine LR decay, temperature annealing, freezing embeddings) sat inside that band and were noise.\n2. **The durable result was deleting code, not adding it.** The changes that survived all *removed* machinery without hurting accuracy: drop the reference model and KL term (β=0), drop the PPO clip (at minibatch=1 the importance ratio is identically 1, so the clip never fires), drop temperature annealing. What's left is plain REINFORCE with a group-relative baseline — simpler than the textbook `gsm8k_grpo.py`, and statistically tied with it.\n\nThat second point is the nanoRL thesis in miniature: for a clean verifiable-reward task at this scale, most of the PPO\u002FGRPO apparatus is optional.\n\n## What's *not* in here (intentionally)\n\nThese minimal files skip a lot of machinery you'd find in production RLHF stacks like [TRL](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl), [OpenRLHF](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF), [veRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl), or [DeepSpeed-Chat](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeedExamples\u002Ftree\u002Fmaster\u002Fapplications\u002FDeepSpeed-Chat):\n\n- **No reward model.** The toy uses a hard-coded `reward_fn`. In real RLHF, the reward model is a separately trained transformer.\n- **No per-token KL penalty** folded into the reward stream. PPO production setups subtract `β · KL_t` at every token; we use `β=0`.\n- **No distributed training.** One process, one device.\n- **No vLLM rollouts.** Generation is sequential `torch.multinomial`, slow but readable.\n- **No advantage whitening (in PPO).** Production PPO normalizes advantages across the batch; we skip it because raw `R − V` is fine at our scale.\n- **No reward model–based critic init.** PPO's critic starts from a fresh Qwen instead of a trained reward model, so it has to learn V from scratch (slow but works for the toy).\n\nEach of these omissions is documented inline where it matters. Adding them back is a tractable exercise.\n\n## Acknowledgements\n\nThe structure and didactic ambition are inspired by [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT) (Karpathy). The RL formulations follow [Schulman et al. 2017](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347) (PPO), [Schulman et al. 2016](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.02438) (GAE), [Rafailov et al. 2023](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18290) (DPO), and [Shao et al. 2024](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03300) (GRPO).\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","nanoRL 是一个极简的单文件实现项目，专注于四种常见的语言模型微调方法：监督式下一句预测（SFT）、直接偏好优化（DPO）、组相对策略优化（GRPO）和近端策略优化（PPO）。每个算法都在一个约100-180行代码的独立文件中实现，并能在简单的算术任务上使用单个GPU或M系列Mac通过MPS在30步内收敛。这些算法根据所需的监督类型不同而有所区别，从需要完整示例的SFT到仅需奖励信号的PPO。该项目特别适合于希望深入了解这几种微调技术及其如何处理不同程度监督信息的研究人员或开发者。",2,"2026-06-11 04:09:21","CREATED_QUERY"]