[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-84125":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":8,"language":10,"languages":8,"totalLinesOfCode":8,"stars":11,"forks":12,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":16,"compositeScore":17,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":8,"trendingCount":13,"starSnapshotCount":13,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},84125,"inference_aware_grpo_training","naomili0924\u002Finference_aware_grpo_training","naomili0924",null,"","Python",96,4,0,17,32,66,75.3,false,"main",true,[],"2026-06-12 04:01:42","# Inference-Aware Reinforcement Learning Training Framework\n\nLLMs are trained to produce correct outputs — but correctness alone doesn't determine serving cost. A model that generates verbose, unpredictable, or cache-unfriendly responses is expensive to deploy, even if its answers are right. **This project makes inference efficiency a first-class training objective.**\n\nThe key insight is that speculative decoding gives us a live, per-request signal of inference cost: the **draft token acceptance rate**. When a small draft model's predictions are frequently accepted by the larger target model, decoding is fast and cheap. When they are rejected, the target must do more work. By feeding this acceptance rate — alongside latency, token length, and KV memory — back into the GRPO training reward, the model learns to generate outputs that are simultaneously high-quality and inference-efficient.\n\nOn top of this, an **adaptive curriculum scheduler** uses KDE-based reward bucketing to detect which training samples the model finds hardest and automatically allocates more rollout iterations to them — improving sample efficiency across GRPO epochs.\n\n---\n\n## What this repo provides\n\n- A custom vLLM engine that exposes per-request speculative decoding telemetry\n- Drop-in replacement for `vllm.LLM` with `get_spec_decode_stats()` after `generate()`\n- A GRPO training loop with a composite reward: task correctness + spec accept rate − latency − token length − KV memory\n- GSM8K math training reaching **83% accuracy** (+15pp over a 68% baseline) using exact answer correctness as the task score\n- A `DatasetScheduler` with KDE valley-finding that buckets samples by EMA reward and assigns dynamic rollout counts per difficulty cluster\n\n---\n\n## Spec Decode Accept Rate — `playground\u002Fmain.py`\n\nTarget: `Qwen\u002FQwen2.5-1.5B-Instruct` · Draft: `Qwen\u002FQwen2.5-0.5B-Instruct` · 5 speculative tokens\n\n```\n=== Spec Decode Accept Rates ===\n\n[0]  accept_rate=100.0%  (95\u002F95 draft tokens accepted)\n  Output: The draft model is a model that can generate a sequence of tokens...\n\n[1]  accept_rate=93.3%   (42\u002F45 draft tokens accepted)\n  Output: It is a variant of the GRU algorithm, which is a type of recurrent neural net...\n\n[2]  accept_rate=100.0%  (50\u002F50 draft tokens accepted)\n  Output: The transformer model is a type of recurrent neural network (RNN)...\n\n[3]  accept_rate=33.3%   (5\u002F15 draft tokens accepted)\n  Output: Reinforcement learning is a type of machine learning that involves the use of feedback...\n\n[4]  accept_rate=88.0%   (66\u002F75 draft tokens accepted)\n  Output: This is a key part of the KV caching mechanism in the TensorFlow framework...\n```\n\nAccess the stats after `generate()`:\n\n```python\nfrom inference_aware_grpo_training import VLLM\nfrom vllm import SamplingParams\n\nllm = VLLM(\n    model=\"Qwen\u002FQwen2.5-1.5B-Instruct\",\n    speculative_config={\n        \"model\": \"Qwen\u002FQwen2.5-0.5B-Instruct\",\n        \"num_speculative_tokens\": 5,\n    },\n)\n\noutputs = llm.generate(prompts, SamplingParams(temperature=0.0, max_tokens=128))\nspec_stats = llm.get_spec_decode_stats()\n\nfor output in outputs:\n    stats = spec_stats.get(output.request_id, {})\n    print(f\"accept_rate={stats['accept_rate']:.1%}  ({stats['num_accepted']}\u002F{stats['num_draft']})\")\n```\n\n---\n\n## GRPO Training Loop — `playground\u002Ftrain_grpo.py`\n\n### Reward Function\n\n```\nreward = task_score\n         - α × latency_ms          (queued → last token)\n         - β × generated_tokens\n         - γ × kv_memory_mb        (estimated from model dims)\n         + δ × speculative_accept_rate\n         + ε × cache_reuse_ratio   (prefix cache hits \u002F prompt len)\n```\n\nDefault weights: `α=0.001, β=0.001, γ=0.01, δ=1.0, ε=0.5`\n\n### Training Run — 50 Steps\n\nTarget: `Qwen\u002FQwen2.5-1.5B-Instruct` · Draft: `Qwen\u002FQwen2.5-0.5B-Instruct`  \nBatch size: 4 prompts · G=4 rollouts per prompt · lr=1e-6\n\n```\nReward weights: alpha=0.001 (latency)  beta=0.001 (tokens)  gamma=0.01 (kv_mb)  delta=1.0 (accept_rate)  eps=0.5 (cache_reuse)\nStarting GRPO — 50 steps, batch=4, G=4\n\nstep    1 | loss=-0.2730 | reward=-2.346 | accept=0.234 | latency=3415ms | kv=3.92MB\nstep    2 | loss=-0.4511 | reward=-2.444 | accept=0.230 | latency=3507ms | kv=3.95MB\nstep    3 | loss=-0.3120 | reward=-2.169 | accept=0.257 | latency=3262ms | kv=3.86MB\nstep    4 | loss=-0.2744 | reward=-2.477 | accept=0.225 | latency=3535ms | kv=3.96MB\nstep    5 | loss=-0.2022 | reward=-2.492 | accept=0.225 | latency=3550ms | kv=3.95MB\n...\nstep   47 | loss=-0.4153 | reward=-2.451 | accept=0.229 | latency=3513ms | kv=3.93MB\nstep   48 | loss=-0.3515 | reward=-2.427 | accept=0.231 | latency=3491ms | kv=3.92MB\nstep   49 | loss=-0.2924 | reward=-2.527 | accept=0.222 | latency=3581ms | kv=3.95MB\nstep   50 | loss=-0.2614 | reward=-2.849 | accept=0.188 | latency=3869ms | kv=3.94MB\nTraining complete.\n```\n\n### Architecture\n\n- **HF model** loaded on CUDA for gradient updates (AdamW)\n- **vllm engine** (separate GPU allocation) for fast rollout generation with spec decode\n- `sync_weights_to_vllm()` calls `model_executor.collective_rpc(\"reload_weights\", ...)` after every optimiser step to push updated weights into the live vllm engine\n\nThe `task_score` defaults to `1.0` — plug in your own quality scorer via the `task_score_fn` argument to `compute_rewards()`.\n\n---\n\n## GSM8K Math Training — `playground\u002Ftrain_grpo_math.py`\n\nTrains on [openai\u002Fgsm8k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fgsm8k) (7,473 train \u002F 1,319 test problems).  \n`task_score` is replaced with **exact answer correctness** (1.0 correct \u002F 0.0 wrong).\n\n### Adaptive Curriculum with `DatasetScheduler`\n\nThe training loop uses a reward-based data scheduler (`inference_aware_grpo_training\u002Fdata_scheduler.py`) that groups samples into difficulty buckets and assigns more rollout iterations to harder samples each epoch.\n\n**Algorithm:**\n\n1. **EMA reward tracking** — each sample's per-step reward is smoothed with an exponential moving average (`ema_decay=0.7`) across epochs, giving a stable difficulty signal that updates as the model learns.\n\n2. **KDE valley-finding** — after each epoch, the EMA reward distribution is min-max normalized to `[0, 1]` (to handle negative rewards), then a Gaussian KDE is fit and local minima (valleys) in the density are found. These valley positions are back-transformed to the original reward scale as bucket boundaries. Up to `n_buckets=3` adaptive clusters are created this way.\n\n3. **Dynamic rollout steps** — each bucket's difficulty is measured by its normalized mean reward, and harder buckets (lower reward) receive proportionally more rollout iterations:\n   ```\n   rollout_steps = base_rollout_steps × (1 + α × (1 − normalized_mean_reward))\n   ```\n   With `base_rollout_steps=4` and `α=1.5`, the hardest bucket gets up to 10 rollouts while the easiest gets 4.\n\n4. **Gradient accumulation** — rollouts within a batch are accumulated one at a time (`loss \u002F rollout_n` per step) before a single optimizer update, keeping peak memory at `O(batch_size)` regardless of rollout count.\n\n5. **Cluster report** — printed after every epoch with per-bucket stats and health indicators (spread + balance):\n\n```\n╔══ Epoch 1 — Cluster Report ══════════════════════════════════════════════════════════════════╗\n│ global  mean=-2.290  std=1.244  min=-6.041  max=0.114\n│ boundaries: -3.956  |  -1.923\n│ bucket 0  [-5.99–-3.96]  n=   42 ( 8.4%)  rollout=9   ← hard\n│ bucket 1  [-3.96–-1.92]  n=  231 (46.2%)  rollout=7   ← medium\n│ bucket 2  [-1.92– 0.11]  n=  227 (45.4%)  rollout=5   ← easy\n╚══════════════════════════════════════════════════════════════════════════════════════════════╝\n```\n\n### Full Training Run Results\n\nTarget: `Qwen\u002FQwen2.5-1.5B-Instruct` · Draft: `Qwen\u002FQwen2.5-0.5B-Instruct`  \n500 training samples · batch=4 · lr=1e-6 · up to 20 epochs · early stop patience=3\n\n| Metric | Value |\n|--------|-------|\n| Baseline accuracy (GSM8K test, 100 samples) | 68% |\n| Best accuracy | **83%** (epoch 6) |\n| Improvement | **+15pp** |\n| Epochs until early stop | 9 (patience=3, no improvement after epoch 6) |\n| Spec accept rate (avg) | **60–70%** |\n\n> Math reasoning produces significantly higher spec accept rates than general text (60–70% vs 20–30%), because chain-of-thought arithmetic follows structured, predictable patterns that the 0.5B draft model can anticipate.\n\n**Epoch-by-epoch accuracy:**\n\n| Epoch | Accuracy | Notes |\n|-------|----------|-------|\n| — | 68% | baseline |\n| 1 | 72% | ★ new best |\n| 2 | 62% | no improvement 1\u002F3 |\n| 3 | 76% | ★ new best |\n| 4 | 66% | no improvement 1\u002F3 |\n| 5 | 76% | no improvement 2\u002F3 |\n| 6 | **83%** | ★ new best |\n| 7 | 65% | no improvement 1\u002F3 |\n| 8 | 71% | no improvement 2\u002F3 |\n| 9 | 60% | no improvement 3\u002F3 → early stop |\n\n### Bucketing Findings\n\n**Epoch 1** produced a genuinely multimodal reward distribution — the KDE found two valleys and split samples into three meaningful difficulty buckets (8% hard, 46% medium, 46% easy) with rollouts ranging from 5 to 9. This is the curriculum working as designed.\n\n**From epoch 2 onward**, the distribution converged to a single sharp peak (mean reward rising from -2.29 → -1.30 by epoch 6) with only ~1.5% of samples remaining as extreme outliers in a left tail. The KDE found no meaningful valley in the main mass, collapsing to a single real boundary. Effectively, 98.5% of samples received uniform rollout=6 for the remaining epochs.\n\n**Why the distribution collapsed fast:** GSM8K rewards are binary (correct=0, incorrect=penalty). Once the model learns the majority of problems, rewards concentrate — there is no intermediate partial-credit signal to keep distinct difficulty clusters apart.\n\n**Did bucketing help?** The curriculum differentiation was active for only the first epoch. The +15pp gain over 9 epochs is primarily attributable to GRPO training itself. The bucketing likely contributed to the early jump (68%→72% in epoch 1) and ensured the ~7 persistently hardest samples always received maximum rollouts (10×), but its marginal contribution beyond uniform sampling is hard to isolate without a control run.\n\n**When bucketing would matter more:** datasets with graded\u002Fpartial rewards, harder tasks where the reward distribution stays multimodal across epochs, or a slower EMA decay that preserves early difficulty signal longer.\n\n### Usage\n\n```bash\nPYTORCH_ALLOC_CONF=expandable_segments:True python playground\u002Ftrain_grpo_math.py\n```\n\nKey config knobs in `GRPOConfig`:\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `max_train_samples` | 500 | Set to `None` for all 7,473 examples |\n| `num_epochs` | 20 | Max training epochs |\n| `early_stop_patience` | 3 | Stop after N epochs without improvement |\n| `batch_size` | 4 | Problems per batch |\n| `base_rollout_steps` | 4 | Rollouts for the easiest (highest-reward) bucket |\n| `rollout_alpha` | 1.5 | Controls rollout scaling for harder buckets |\n| `n_buckets` | 3 | Number of adaptive difficulty buckets |\n| `ema_decay` | 0.7 | EMA smoothing for per-sample reward history |\n| `gpu_memory_utilization` | 0.2 | vLLM memory fraction (leave headroom for HF model) |\n| `eval_every` | 1 | Eval on test set every N epochs |\n| `eval_samples` | 100 | Test problems per eval |\n| `max_new_tokens` | 512 | Tokens for chain-of-thought |\n\n---\n\n## Repository Structure\n\n```\ninference_aware_grpo_training\u002F\n├── data_scheduler.py           # Adaptive curriculum: RewardTracker, AdaptiveBucketizer, DatasetScheduler\n├── entrypoints\u002F\n│   └── llm.py                  # VLLM class (drop-in for vllm.LLM) + get_spec_decode_stats()\n├── v1\u002F\n│   ├── core\u002Fsched\u002F\n│   │   └── scheduler.py        # VLLMScheduler — accumulates spec decode stats per request\n│   └── engine\u002F\n│       ├── core.py             # VLLMEngineCore — replaces default scheduler with VLLMScheduler\n│       ├── core_client.py      # VLLMInprocClient — single engine-core creation\n│       └── llm_engine.py       # VLLMEngine — patches make_client to avoid double model load\nplayground\u002F\n├── main.py                     # Spec decode accept rate demo (5 requests)\n├── train_grpo.py               # Generic GRPO training loop\n└── train_grpo_math.py          # GRPO on GSM8K with adaptive curriculum + dynamic rollouts\n```\n\n---\n\n## Setup\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fnaomili0924\u002Finference_aware_grpo_training.git\ncd inference_aware_grpo_training\npip install -e .\n```\n\nRequires vLLM 0.19.0, PyTorch 2.10, and a CUDA GPU.\n",2,"2026-06-11 04:12:20","CREATED_QUERY"]