[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81044":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":11,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":14,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":15,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":16,"topics":19,"createdAt":9,"pushedAt":9,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":23,"discoverSource":24},81044,"PopRiskMinimization","elonlit\u002FPopRiskMinimization","elonlit","Operationalization of Population Risk Minimization algorithm from \"A Theory of Generalization in Deep Learning.\"",null,"Python",28,3,2,0,38.81,false,"main",true,[],"2026-06-12 04:01:31","# Population Risk Minimization for Neural Networks\n\n`popriskmin` gives you `PRM`, a small modification of AdamW that trains on population risk instead of raw empirical risk. It keeps AdamW's usual one forward pass and one backward pass per step, then preconditions each parameter update with the population-risk mask from Litman & Guo (2026), A Theory of Generalization in Deep Learning.\n\nPRM tracks one extra tensor per parameter tensor: an exponential moving average of centered minibatch gradient variance. For each scalar parameter $k$, it asks whether the batch-mean gradient is larger than the leave-one-out noise estimate:\n\n$$\n\\mu_k^2 > \\alpha \\sigma_k^2\n$$\n\nParameters that pass get the Adam update. Parameters that fail get shrunk or zeroed, depending on the mask. On the fresh-batch boundary, $\\alpha = 1$ and the streaming variance estimates the leave-one-out penalty $\\Sigma_B \u002F (b - 1)$.\n\n## Install\n\n```sh\nuv pip install -e .\nuv sync --extra test\n```\n\n`PRM` requires `torch>=2.0`.\n\n## Quick start\n\n```python\nfrom popriskmin import PRM\n\noptimizer = PRM(\n    model.parameters(),\n    lr=3e-4,\n    weight_decay=0.01,\n    softness=1.0,\n    batch_size=32,\n)\n\nfor batch in loader:\n    optimizer.zero_grad()\n    loss = loss_fn(model(batch.x), batch.y)\n    loss.backward()\n    optimizer.step()\n```\n\nUseful options:\n\n```python\nPRM(model.parameters(), mask=\"snr\")         # default, smooth SNR mask\nPRM(model.parameters(), mask=\"soft\")        # strict Algorithm 1 cutoff\nPRM(model.parameters(), mask=\"hard\")        # 0\u002F1 theorem mask\nPRM(model.parameters(), reduction=\"per_tensor\")\n```\n\nUse `reduction=\"per_tensor\"` when each scalar parameter is noisy but the whole parameter tensor has a clear signal. This is often the more useful setting for large generative models, diffusion, and CFM-style training.\n\n## Masks\n\n| `mask` | Formula | Behavior |\n|--------|---------|----------|\n| `snr` | $\\hat{m}^2 \u002F (\\hat{m}^2 + \\lambda_p \\alpha \\hat{s} + \\varepsilon)$ | Default. Smooth, never fully shuts off. With `softness=1`, it gives $q = 1\u002F2$ on the boundary. |\n| `soft` | $\\max(\\hat{m}^2 - \\alpha \\hat{s}, 0) \u002F (\\max(\\hat{m}^2 - \\alpha \\hat{s}, 0) + \\lambda_p \\hat{s} + \\varepsilon)$ | Strict Algorithm 1 mask. Zero below the boundary. |\n| `hard` | $\\mathbf{1}[\\hat{m}^2 > \\alpha \\hat{s}]$ | Binary indicator from Theorem 6.5. Mostly useful for ablations. |\n\n`softness` is $\\lambda_p$. Larger values make the mask more conservative. A reasonable first sweep is `0.3, 1, 3, 10`.\n\n## Arguments\n\n| Argument | Default | Notes |\n|----------|---------|-------|\n| `lr` | `1e-3` | Tune as you would for AdamW. |\n| `betas` | `(0.9, 0.999)` | Adam moment decay rates. |\n| `rho` | `0.99` | Decay for the centered gradient variance. Usually shorter than `beta2`. |\n| `eps` | `1e-8` | Stabilizer for Adam and the mask denominator. |\n| `weight_decay` | `0.01` | Decoupled AdamW-style weight decay by default. |\n| `softness` | `1.0` | Population-risk mask regularizer. |\n| `batch_size` | `None` | Optional for `boundary=\"batch\"`. Required for `boundary=\"empirical\"`. |\n| `boundary` | `\"batch\"` | Use `\"batch\"` for online or fresh-batch training. Use `\"empirical\"` for finite-dataset leave-one-out. |\n| `n_dataset` | `None` | Required when `boundary=\"empirical\"`. |\n| `mask` | `\"snr\"` | One of `\"snr\"`, `\"soft\"`, or `\"hard\"`. |\n| `reduction` | `\"per_param\"` | Use one mask per scalar parameter. Set `\"per_tensor\"` to pool each parameter tensor first. |\n| `warmup_steps` | `0` | Force the mask to 1 for the first N optimizer steps. |\n| `amsgrad` | `False` | Use AMSGrad for the Adam denominator. |\n| `bias_correction` | `True` | Apply Adam-style bias correction to `m`, `v`, and `s`. |\n| `decoupled_weight_decay` | `True` | Set to `False` for coupled L2. |\n\n## Diagnostics\n\n```python\nstats = optimizer.get_mask_stats()\nprint(stats)\n```\n\nExample output:\n\n```python\n{\n    \"mean_q\": 0.62,\n    \"active_fraction\": 0.71,\n    \"min_q\": 0.00,\n    \"max_q\": 0.99,\n    \"parameter_count\": 1_245_184,\n    \"noise_scale\": 12.4,\n    \"signal_sq\": 18.7,\n    \"snr\": 1.51,\n}\n```\n\n`mean_q` tells you how open the mask is on average. If `active_fraction` falls\nnear zero, almost every parameter is below the leave-one-out boundary. Try a lower `softness`, a higher learning rate, or `reduction=\"per_tensor\"` before assuming the method is broken.\n\n## When it helps\n\nPRM is meant for settings where plain empirical-risk training fits structured noise or memorizes before it generalizes. The paper reports improvements in:\n\n| Setting | AdamW | PRM |\n|---------|-------|-----|\n| Modular division, 25% train fraction | Groks at step 29,450 | Groks at step 5,950 |\n| PINN with noisy initial condition, $\\beta = 5$ | Best LR-tuned run: 3,300 iterations | 1,400 iterations to relative $\\ell_2 \\le 0.40$ |\n| Qwen2.5-0.5B-Instruct with 30% noisy DPO | Reward accuracy 0.566, drift 0.41 | Reward accuracy 0.641, drift 0.14 |\n\nWith more experiments in the appendix. However, PRM is not magic. If AdamW already reaches the solution without overfitting, PRM may not help.\n\n## Example\n\nRun the smoke test:\n\n```sh\nuv run python examples\u002Fsynthetic_regression.py\n```\n\nThe script trains AdamW and several PRM variants on a noisy regression problem. You should see the optimizer learning the underlying signal instead of the corrupted training labels.\n\n## Layout\n\n```text\npopriskmin\u002F\n|-- popriskmin\u002F\n|   |-- __init__.py\n|   |-- mask.py\n|   `-- optimizer.py\n|-- examples\u002F\n|   `-- synthetic_regression.py\n|-- tests\u002F\n|   `-- test_optimizer.py\n`-- pyproject.toml\n```\n\n## Citation\n\n```bibtex\n@misc{litman2026theory,\n  title         = {A Theory of Generalization in Deep Learning},\n  author        = {Litman, Elon and Guo, Gabe},\n  year          = {2026},\n  eprint        = {2605.01172},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.LG},\n  doi           = {10.48550\u002FarXiv.2605.01172}\n}\n```\n","PopRiskMinimization项目实现了基于论文\"A Theory of Generalization in Deep Learning\"中的群体风险最小化算法。其核心功能是通过修改AdamW优化器，引入了基于群体风险而非原始经验风险的参数更新机制，从而在保持单次前向和后向传播的同时，利用群体风险掩码调整梯度更新，有助于提高模型泛化能力。技术上，该项目支持多种掩码选项（如SNR、soft、hard），允许用户根据具体需求选择不同的参数更新策略。适用于深度学习场景下，特别是对于大规模生成模型、扩散模型等需要提升训练稳定性和泛化性能的情况。","2026-06-11 04:03:17","CREATED_QUERY"]