[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80158":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":15,"compositeScore":13,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":17,"hasPages":17,"topics":19,"createdAt":10,"pushedAt":10,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":13,"starSnapshotCount":13,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},80158,"RL-Projects","poetrywanderer\u002FRL-Projects","poetrywanderer","This include experimental RL Projects on LLM, VLM & Generative tasks","",null,"Python",102,0,52,1,39,false,"main",[],"2026-06-12 02:03:58","# On-Policy Post-Training Across Modalities: RL & Distillation\n\n**[中文版](README_CN.md)**\n\nThree projects organized by task modality, systematically exploring on-policy post-training methods (RL + distillation) and their effectiveness boundaries.\n\n---\n\n## Overview\n\n| | Text-RL | Multimodal-PostTrain | Diffusion-RL |\n|---|---|---|---|\n| **Directory** | [R1-RL\u002F](R1-RL\u002F) | [Geo3K-VL-RL\u002F](Geo3K-VL-RL\u002F) + [Geo3K-VL-OPD\u002F](Geo3K-VL-OPD\u002F) | [Diffusion-Flow-RL\u002F](Diffusion-Flow-RL\u002F) |\n| **Modality** | text → text | image+text → text | text → image |\n| **Model** | Qwen2.5-3B | Qwen2.5-VL-7B | SD3.5-Medium (2.24B) |\n| **Task** | Countdown (arithmetic) | Geometry3K (geometry reasoning) | OCR text rendering |\n| **Methods** | GRPO | GRPO → OPD → **OPD→GRPO two-phase** | Flow-GRPO |\n| **Key Result** | 0.2% → **73%** | +**16.3 pp** (two-phase optimal) | 11% → **89%** |\n| **Hardware** | 4×L40S, 6h | 4-8×L40S, ~20h all experiments | 4×L40S, 1.5h |\n\n---\n\n## Project Details\n\n### Text-RL: Text Reasoning → [R1-RL\u002F](R1-RL\u002F)\n\n**Reproducing R1-Zero's emergent reasoning on Countdown (3B scale)**, with systematic cross-task transfer evaluation:\n\n| Benchmark | Base | After RL (step 350) | Δ |\n|-----------|------|---------------------|---|\n| Countdown val acc | 0.2% | **73.2%** | +295× |\n| GSM8K (zero-shot) | 70.0% | **77.6%** | +7.6 pp |\n| MATH-500 (zero-shot) | 33.4% | **42.4%** | +9.0 pp |\n| BBH multistep arithmetic | 10.4% | **23.2%** | +12.8 pp |\n| BBH causal judgement | 54.5% | 24.1% | **−30.5 pp** |\n\nKey findings:\n1. **Scale threshold between 1.5B and 3B** — 1.5B collapses to short templates (no emergence); 3B shows classic U-shaped length curve\n2. **Transfer is task-structure dependent** — numeric planning tasks gain (+12.8 pp); language judgment tasks suffer (−30.5 pp)\n3. **RL trains search planning** — decompose → compute → verify → retry; this algorithmic skill generalizes within its structural domain\n4. **Negative transfer is real** — the same RL step gives +12.8 pp on arithmetic and −30.5 pp on causal stories\n\n### Multimodal-PostTrain: Multimodal Geometry → [Geo3K-VL-RL\u002F](Geo3K-VL-RL\u002F) + [Geo3K-VL-OPD\u002F](Geo3K-VL-OPD\u002F)\n\n**Systematic comparison of three post-training paradigms on the same task (Geometry3K)**, validating the OPD→RL two-phase hypothesis:\n\n| Phase | Method | Accuracy | Notes |\n|-------|--------|----------|-------|\n| Baseline | No training | 37.8% | Qwen2.5-VL-7B-Instruct |\n| Exp A | GRPO only (120 steps) | 47.8% | Sparse reward, 76K seqs |\n| Exp B | OPD only (600 steps) | ~50.0% | Dense teacher KL, 6K seqs |\n| **Exp C** | **OPD→GRPO (600+786 steps)** | **54.2%** | **Two-phase optimal** |\n\nKey findings:\n1. \"Format fast, accuracy slow\" — GRPO's dual-timescale learning\n2. OPD is 16× more data-efficient — dense token-level signal eliminates credit assignment\n3. RL teaches \"visual grounding\" — model learns to read diagram values before computing\n4. **Two-phase breaks both ceilings** — OPD aligns to teacher level, RL explores beyond\n\n### Diffusion-RL: Image Generation → [Diffusion-Flow-RL\u002F](Diffusion-Flow-RL\u002F)\n\n**Flow-GRPO on SD3.5-Medium for OCR text rendering**, comparing verifiable vs learned rewards under identical setup:\n\n| Experiment | Reward Type | Result | Training Time |\n|------------|-------------|--------|---------------|\n| PickScore run | Learned (CLIP-based) | +2.3% | 11 hours |\n| **OCR run** | **Verifiable (EasyOCR + Levenshtein)** | **11%→89%** | **1.5 hours** |\n| Sliding window ablation | Verifiable (same) | Peak **91.3%** | 2.5× more epochs |\n\n| Signal Metric | PickScore (learned) | OCR (verifiable) | Ratio |\n|---------------|---------------------|-------------------|-------|\n| Per-group reward std | 0.016 | 0.130 | **8.1×** |\n| Mean advantage | 0.19 | 0.56 | 2.9× |\n\nKey findings:\n1. **Intra-group reward variance predicts GRPO effectiveness** — check per-group std before deploying; low std → near-zero advantage → no learning\n2. **RL achieves implicit CFG distillation** — no-CFG after RL (89%) surpasses base+CFG (55%) while halving inference cost\n3. **LoRA capacity trade-off** — OCR +735% improvement \"borrows\" from aesthetics (−12%), not catastrophic forgetting but capacity competition\n4. **Group saturation is the terminal failure mode** — after epoch 100, most groups become all-correct → zero advantage → learning halts at ~89%\n5. **Sliding window (MixGRPO-lite) has limited benefit on 10-step models** — OCR signal concentrates in early denoising steps; fixed window [0,5] converges faster than full-range [0,9] within same epoch budget. MixGRPO's advantage is step-count-dependent (critical at 50 steps, marginal at 10)\n\n---\n\n## Cross-Project Key Takeaways\n\n### 1. Signal Density Is the First Principle of Training Efficiency\n\n> \"OPD beats GRPO with 1\u002F12 the data because every token carries gradient signal, rather than an entire sequence sharing one scalar.\"\n\n| Signal Type | Density | Example | Effect |\n|-------------|---------|---------|--------|\n| **Token-level teacher distribution** | Very high (~64 floats\u002Ftoken) | OPD (Multimodal) | 6,288 seqs → +12pp |\n| **Sequence-level verifiable reward** | Low (1 bit\u002Fseq) | GRPO (all three projects) | 76,800 seqs → +10pp |\n| **Sequence-level learned reward** | Very low (noisy) | PickScore (Diffusion) | 11h for only +2.3% |\n\n### 2. PG-OPD: RL Framework for Distillation Objective\n\n> \"Swap the reference model for a teacher model in the RL trainer — one-line change, matching Qwen3 results at 1\u002F10 the cost.\"\n\nSource: [Thinking Machines Lab](https:\u002F\u002Fthinkingmachines.ai\u002Fblog\u002Fon-policy-distillation\u002F) (2025), adopted by Qwen3\u002Fverl. Details in [Geo3K-VL-OPD\u002F](Geo3K-VL-OPD\u002F).\n\n### 3. OPD and RL Have Different Ceilings — Two-Phase Training Is Optimal\n\n> \"OPD is limited by what the teacher knows; RL is limited by what the model can perceive. Combining them breaks both.\"\n\n| Method | Ceiling | Geo3K Result |\n|--------|---------|-------------|\n| OPD only | Teacher capability (49.3%) | ~50% (matches teacher) |\n| GRPO only | Visual encoder perception | 47.8% (saturates at 120 steps) |\n| **OPD→GRPO** | Upper bound of both | **54.2%** (breaks both!) |\n\n### 4. GRPO Generalizes Across Discrete and Continuous Action Spaces\n\n> \"Group-relative advantage normalization without a critic transfers from language tokens to diffusion denoising trajectories.\"\n\n### 5. RL Reshapes How Models Use Information — It Doesn't Inject New Knowledge\n\n| Project | What RL teaches | What it can't teach |\n|---------|----------------|---------------------|\n| Text-RL | Search & verification discipline | New math knowledge |\n| Multimodal | Visual grounding (read diagram first) | What ViT can't perceive |\n| Diffusion | Text rendering (denoise toward readable) | New fonts or layouts |\n\n### 6. Training Window Strategy Must Match Task-Specific Timestep Sensitivity\n\n> \"Fixed window [0,5] beats sliding window within the same epoch budget on 10-step OCR — because text structure is decided in early denoising. The optimal training window is not 'all steps' but 'the steps that carry task-relevant information'.\"\n\n| Step count | Task-critical steps | Optimal window strategy |\n|---|---|---|\n| 10 steps (our SD3.5) | Steps 0–4 (structure) | Fixed early window — concentrate signal |\n| 50 steps (FLUX\u002FMixGRPO) | Steps 0–49 (all contribute) | Sliding window — cover all |\n\n---\n\n## RL Failure Modes Quick Reference (from Diffusion-RL)\n\n| Failure Mode | Manifestation | Mitigation |\n|--------------|---------------|------------|\n| Reward Hacking | proxy↑ gold↓ | Use verifiable reward |\n| Catastrophic Forgetting | other capabilities lost | LoRA (freeze base) |\n| Group Saturation | all-correct groups → zero advantage | Dynamic prompt scheduling |\n| Ratio Bias | importance ratio \u003C1 at low-noise steps | Ultra-conservative clip (1e-5) |\n\n---\n\n## Hardware\n\n4–8×NVIDIA L40S (48 GB each), single node. Total training across all three projects: ~28 hours.\n\n## Repository Structure\n\n```\nRL-Projects\u002F\n├── README.md \u002F README_CN.md       # Overview (this file)\n├── R1-RL\u002F                         # Text-RL: text reasoning\n├── Geo3K-VL-RL\u002F                   # Multimodal-PostTrain: GRPO experiments\n├── Geo3K-VL-OPD\u002F                  # Multimodal-PostTrain: OPD + two-phase experiments\n└── Diffusion-Flow-RL\u002F             # Diffusion-RL: image generation\n```\n","该项目探索了在文本、多模态和生成任务中使用强化学习（RL）与蒸馏技术进行策略后训练的方法及其有效性边界。核心功能包括针对不同模态的任务，如文本推理、多模态几何推理和OCR文本渲染，采用GRPO、OPD及Flow-GRPO等方法显著提升了模型性能。项目展示了如何通过两阶段训练（先OPD后RL）达到最佳效果，并揭示了规模阈值、任务结构依赖性及算法技能泛化等关键发现。适用于需要增强大语言模型或视觉语言模型特定任务能力的研究者和开发者。",2,"2026-06-11 03:59:28","CREATED_QUERY"]