[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80854":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":8,"language":10,"languages":8,"totalLinesOfCode":8,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":8,"pushedAt":8,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":12,"lastSyncTime":27,"discoverSource":28},80854,"MLS-Bench","Imbernoulli\u002FMLS-Bench","Imbernoulli",null,"https:\u002F\u002Fmls-bench.com","Python",43,2,35,1,0,4,8,12,1.43,false,"main",true,[],"2026-06-12 02:04:07","\u003Cdiv align=\"center\">\n\n\u003Ch1 align=\"center\">MLS-Bench\u003C\u002Fh1>\n\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-mls--bench.com-10A37F)](https:\u002F\u002Fmls-bench.com)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.08678-b31b1b)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.08678)\n[![Hugging Face Dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-Dataset%20%26%20SIFs-FFD21E)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBohan22\u002FMLS-Bench-Tasks)\n[![Docker Hub](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocker%20Hub-bohanlyu2022-2496ED?logo=docker&logoColor=white)](https:\u002F\u002Fhub.docker.com\u002Fu\u002Fbohanlyu2022)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join-5865F2)](https:\u002F\u002Fdiscord.gg\u002FEsxaCZpSAu)\n\n**MLS-Bench** is a benchmark for **machine learning science**. Where most agent benchmarks reward engineering one fixed instance — clean the data, tune the pipeline, climb a leaderboard — MLS-Bench asks the harder question: can an AI agent propose a new component, loss, optimizer, or training procedure whose gain transfers across settings, seeds, datasets, and scales?\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fmls-main.png\" alt=\"MLS-Bench overview\" width=\"900\">\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\nThe benchmark contains **140 tasks across 12 ML research domains**. Each task fixes a research scaffold, gives the agent the relevant source code and strong baseline implementations, then asks for one algorithmic change inside a constrained edit surface.\n\n## News\n\n- **2026.5** — **Harbor support**: official Harbor-compatible runtime and pre-rendered task images on Docker Hub under `bohanlyu2022\u002Fmlsbench-harbor-*`. See [`harbor\u002FREADME.md`](harbor\u002FREADME.md).\n- **2026.5** — **Stronger _Sparse L0 Adversarial Attack_ task**: upgraded to the canonical Sparse-RS L0 threat model (k=24, untargeted) against three adversarially-robust RobustBench L2 CIFAR-10 targets (Rebuffi-R18 \u002F Augustin \u002F Engstrom). Strong attacks no longer trivially saturate, leaving real headroom to measure genuine attack improvements.\n- **2026.5** — **Scoring**: the main results table in the [arXiv paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.08678) previously aggregated tasks within each area by geometric mean; switched to arithmetic mean for easier comparison with the per-task numbers. Rankings are unchanged and no conclusions are affected.\n\n## Installation\n\n```bash\npip install -e \".[agent]\"\n```\n\nPython 3.10+ is required. MLS-Bench separates the choice of **runtime\nbackend** from the choice of **job scheduler**, and any combination of the\ntwo is supported:\n\n- **Runtime backends**: Docker, Apptainer, or local Conda — selected in\n  your config file via `container_runtime`.\n- **Job schedulers**: SLURM (when a `slurm:` section is present in the\n  config) or the built-in single-node GPU scheduler.\n\n```yaml\ncontainer_runtime: docker      # docker, apptainer, or local\n```\n\n**Recommended setup**: Docker or Apptainer for the runtime, with SLURM as\nthe job scheduler. If SLURM is unavailable, the built-in scheduler can be\ncombined with any of the three runtimes. If neither a container runtime\nnor SLURM is available, the local Conda backend together with the built-in\nscheduler provides a complete fallback (see the section below).\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Running with local Conda environments and the built-in scheduler\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nWhen neither Docker nor Apptainer is available, MLS-Bench can build a\ndedicated Conda environment per package and dispatch jobs through a\nsingle-node GPU queue (`src\u002Fmlsbench\u002Fscheduler.py`). This backend is\nintended for development and small-scale experimentation; for full-scale\nbenchmarking on a cluster we recommend SLURM with one of the container\nruntimes instead. The Conda backend should not be combined with SLURM,\nsince both attempt to schedule GPU jobs.\n\n1. Use a config with `container_runtime: local` and no `slurm:` section.\n   Throughout this section we refer to it as `configs\u002Flocal.yaml`.\n\n2. Build the environment for each package:\n\n   ```bash\n   mlsbench build \u003Cpackage> --config configs\u002Flocal.yaml\n   ```\n\n3. Start the GPU scheduler:\n\n   ```bash\n   nohup python -m mlsbench.scheduler start \\\n     --gpus 0,1,2,3 \\\n     --config configs\u002Flocal.yaml \\\n     > .scheduler\u002Fscheduler.log 2>&1 &\n   ```\n\n4. Launch agents or baselines. They enqueue jobs to the scheduler and\n   return immediately:\n\n   ```bash\n   PYTHONPATH=src nohup python3 -m mlsbench agent \u003Ctask> --model \u003Cmodel> \\\n     --config configs\u002Flocal.yaml \\\n     > .scheduler\u002Flogs\u002Fagent_\u003Ctask>.log 2>&1 &\n   ```\n\n5. Inspect or manage the queue:\n\n   ```bash\n   python -m mlsbench.scheduler status\n   python -m mlsbench.scheduler list\n   python -m mlsbench.scheduler cancel \u003Cjob_id>\n   python -m mlsbench.scheduler clear\n   ```\n\nTo rebuild a package's environment from scratch, remove it with\n`conda env remove -n mlsbench-\u003Cpackage>` and re-run `mlsbench build`.\n\n\u003C\u002Fdetails>\n\n## API Keys\n\nRunning an agent requires an API key for the model provider you choose. If\nyou enable the optional web-search tool, a Tavily key is also required.\nConfigure keys in either of two equivalent ways:\n\n**1. Inline in your config file** under the `providers:` block — useful when\nyou want to keep separate configs per environment or per project:\n\n```yaml\nproviders:\n  openai:\n    api_key: \"sk-...\"\n  anthropic:\n    api_key: \"sk-ant-...\"\n  openrouter:\n    api_key: \"sk-or-...\"\n    base_url: \"https:\u002F\u002Fopenrouter.ai\u002Fapi\u002Fv1\"\n  deepseek:\n    api_key: \"sk-...\"\n    base_url: \"https:\u002F\u002Fapi.deepseek.com\u002Fv1\"\n  tavily:\n    api_key: \"tvly-...\"     # only needed if the web_search tool is enabled\n```\n\n**2. Environment variables** — leave the `api_key` field empty (or omit the\nprovider entirely) and the CLI falls back to the standard env var for that\nprovider:\n\n| Provider | Env var |\n| --- | --- |\n| OpenAI | `OPENAI_API_KEY` |\n| Anthropic | `ANTHROPIC_API_KEY` |\n| OpenRouter | `OPENROUTER_API_KEY_NEW` |\n| DeepSeek | `DEEPSEEK_API_KEY` |\n| Qwen \u002F DashScope | `QWEN_API_KEY` \u002F `DASHSCOPE_API_KEY` |\n| Gemini \u002F Google | `GEMINI_API_KEY` \u002F `GOOGLE_API_KEY` |\n| Kimi \u002F Moonshot | `KIMI_API_KEY` \u002F `MOONSHOT_API_KEY` |\n| GLM | `GLM_API_KEY` |\n| MiniMax | `MINIMAX_API_KEY` |\n| Tavily (web search) | `TAVILY_API_KEY` |\n\nYou can also use `${ENV_VAR}` interpolation inside the YAML\n(`api_key: \"${OPENAI_API_KEY}\"`) when you want a tracked config file that\nstill resolves the secret from the environment at runtime.\n\nThe model string passed to `mlsbench agent --model \u003Cname>` selects the\nprovider automatically:\n\n- **Bare names** are dispatched by their well-known prefix:\n  `claude-*` → `providers.anthropic`,\n  `gpt-* \u002F o1 \u002F o3 \u002F o4` → `providers.openai`,\n  `deepseek-*` → `providers.deepseek`,\n  `qwen-*` → `providers.qwen`,\n  `gemini-*` → `providers.gemini`,\n  `kimi-* \u002F moonshot-*` → `providers.kimi`,\n  `glm-*` → `providers.glm`,\n  `minimax-*` → `providers.minimax`.\n- **Prefixed names** (`\u003Cprovider>\u002F\u003Cmodel>`, e.g. `openai\u002Fgpt-5.4`,\n  `vertex_ai\u002F...`, `openrouter\u002Fanthropic\u002Fclaude-opus-4.6`) dispatch\n  generically to the matching `providers.\u003Cprovider>` entry. Point that\n  entry's `base_url` at whichever upstream you want — direct API,\n  OpenRouter, a LiteLLM proxy, etc. — and the same key is reused.\n\n## Quick Start\n\nFetch external packages and build the runtime (data dependencies are\nprepared automatically as part of the build):\n\n```bash\nmlsbench fetch --name \u003Cpackage>\nmlsbench build \u003Cpackage> --config configs\u002Freact.yaml\n```\n\nRun an agent and compute its task score:\n\n```bash\nmlsbench agent \u003Ctask> --model \u003Cmodel> --config configs\u002Freact.yaml\nmlsbench score task \u003Ctask>\n```\n\nBaseline scores are already populated in each task's `leaderboard.csv`, so\nrunning an agent alone is sufficient to obtain its normalized score under\nthe MLS-Bench evaluation framework. Before launching the agent, however, we\nrecommend running one baseline first to confirm that your environment is\nset up correctly:\n\n```bash\nmlsbench baseline \u003Ctask> --name \u003Cbaseline> --config configs\u002Freact.yaml\n```\n\nBaselines and agents share the same task scripts, parsers, seeds, resource\nlimits, and leaderboard code; only the source of the edits differs.\n\n## Prebuilt Container Images\n\nTo avoid building each package from source, prebuilt images are published\nfor every supported package:\n\n- **Docker Hub**: `bohanlyu2022\u002Fmlsbench-\u003Cpkg>:latest` —\n  \u003Chttps:\u002F\u002Fhub.docker.com\u002Fu\u002Fbohanlyu2022>\n- **Hugging Face (Apptainer SIFs)**: `sif\u002F\u003CPkg>.sif` inside the\n  [Bohan22\u002FMLS-Bench-Tasks](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBohan22\u002FMLS-Bench-Tasks)\n  dataset\n\n`mlsbench agent`, `mlsbench baseline`, and `mlsbench build` automatically\npull the prebuilt image when the local image is missing, and fall back to\nbuilding from source on failure. `mlsbench run` performs the same lookup\nbut does not build from source; run `mlsbench build \u003Cpkg>` first if a\nlocal build is required.\n\nTwo mutually-exclusive flags force a specific source for `mlsbench build`:\n\n```bash\nmlsbench build \u003Cpackage> --pull          # use only the prebuilt image\nmlsbench build \u003Cpackage> --local-build   # build locally from the Dockerfile \u002F .def\n```\n\nFor Apptainer, the SIF can be obtained either via `apptainer pull\ndocker:\u002F\u002F...` (default) or from the Hugging Face mirror — a direct HTTPS\ndownload of `sif\u002F\u003CPkg>.sif`, which can be faster in networks where Docker\nregistries are slow. Select the source with `--sif-source {docker,hf,auto}`\non `mlsbench build`.\n\n## Running under Harbor\n\nMLS-Bench's 140 tasks are also available as a [Harbor](https:\u002F\u002Fgithub.com\u002Fharbor-framework\u002Fharbor)\ndataset so any Harbor-supported agent (`claude-code`, `codex`, `openhands`,\n`terminus-2`, …) can be evaluated on the suite without going through this\nrepository's own runner:\n\n```bash\nPYTHONPATH=. harbor run -c run.yaml -a claude-code -m anthropic\u002Fclaude-opus-4-7\n```\n\nThe pre-rendered dataset, GPU-capable environment plugin, and reference\nHarbor config live under [`harbor\u002F`](harbor\u002F). See\n[`harbor\u002FREADME.md`](harbor\u002FREADME.md) for usage details and the\nself-contained per-task layout.\n\n## Repository Map\n\n```text\nsrc\u002Fmlsbench\u002F                  CLI, agent loop, execution backends, scoring\ntasks\u002F\u003Ctask>\u002F                  140 task definitions, parsers, scores, baselines\nvendor\u002Fpackages.yaml           External package registry\nvendor\u002Fpkg_configs\u002F\u003Cpackage>\u002F  Package runtime configs and pre-edit patches\nvendor\u002Fdata_scripts\u002F           Dataset and model-cache preparation scripts\nconfigs\u002Freact.yaml             Runtime and provider configuration\nconfigs\u002Fopenevolve.yaml        OpenEvolve defaults\nconfigs\u002Fdiscover.yaml          Discover defaults\nharbor\u002F                        Pre-rendered Harbor dataset (140 tasks) + run config\n```\n\nFetched upstream repositories, built images, downloaded datasets, run workspaces, logs, and scheduler state are intentionally not versioned.\n\n## Full Task Catalog\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Show the 140-task appendix table\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n| Area | Directory shorthand | Task | Research question | External package(s) | Baselines | Evaluation settings |\n| --- | --- | --- | --- | --- | --- | --- |\n| LM | [agent-tool-reasoning](tasks\u002Fagent-tool-reasoning) | LLM Agent Tool-Use Reasoning Strategy | Studies how tool-use search, backtracking, and stopping policies affect answer validity and query efficiency. | [zhichengg\u002FStableToolBench](https:\u002F\u002Fgithub.com\u002Fzhichengg\u002FStableToolBench) | Greedy Chain (CoT)\u003Cbr>DFS with LLM Ranking\u003Cbr>DFSDT | StableToolBench I1-instruction 50q \u002F deepseek-chat\u003Cbr>StableToolBench I1-instruction 50q \u002F qwen2.5-72b-instruct\u003Cbr>StableToolBench I1-instruction 50q \u002F qwen2.5-7b-instruct |\n| LM | [llm-dllm-demask-strategy](tasks\u002Fllm-dllm-demask-strategy) | Masked Diffusion LM: Demasking Strategy | Studies how demasking schedules, position selection, and token assignment affect diffusion language-model quality and decoding efficiency. | [ML-GSAI\u002FLLaDA](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FLLaDA) | Top-K Margin\u003Cbr>Confidence Greedy\u003Cbr>KLASS | LLaDA \u002F MATH-500\u003Cbr>LLaDA \u002F HumanEval\u003Cbr>Dream \u002F C4 prefix continuation |\n| LM | [llm-pretrain-attention](tasks\u002Fllm-pretrain-attention) | Autoregressive Attention Mechanism | Studies how self-attention computation and positional handling affect autoregressive pretraining loss and downstream accuracy. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | QK-Norm\u003Cbr>RoPE\u003Cbr>RoPE + QK-Norm | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-bitlinear](tasks\u002Fllm-pretrain-bitlinear) | Low-Bit Linear Pretraining Layer | Studies how low-bit linear layers and quantization functions affect pretraining loss under discrete weight constraints. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | Binary Sign (BitNet)\u003Cbr>Ternary 1.58-bit (BitNet b1.58)\u003Cbr>INT2 Uniform | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-embedding](tasks\u002Fllm-pretrain-embedding) | Autoregressive Embedding Strategy | Studies how token embeddings, position embeddings, value embeddings, and weight tying affect autoregressive pretraining loss and downstream accuracy. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | Untied Embeddings\u003Cbr>Value Embeddings\u003Cbr>Bigram Hash Embeddings | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-linear-attention](tasks\u002Fllm-pretrain-linear-attention) | Subquadratic Attention Mechanism | Studies whether linear or subquadratic attention can reduce autoregressive validation loss while preserving downstream performance. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | RetNet\u003Cbr>DeltaNet\u003Cbr>GLA | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-loss](tasks\u002Fllm-pretrain-loss) | Autoregressive Pretraining Loss | Studies how alternative next-token training losses affect autoregressive validation cross-entropy. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | Label Smoothing\u003Cbr>Softcap Cross-Entropy\u003Cbr>Z-Loss | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-lr-schedule](tasks\u002Fllm-pretrain-lr-schedule) | Pretraining Learning-Rate Schedule | Studies how warmup, decay shape, and schedule horizon affect autoregressive pretraining validation loss. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | WSD (Warmup-Stable-Decay)\u003Cbr>Trapezoidal\u003Cbr>WSD with Inverse-Sqrt Decay | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-mlp](tasks\u002Fllm-pretrain-mlp) | Transformer Feed-Forward Block | Studies how activation, gating, and expansion choices in the feed-forward sublayer affect language-model validation loss. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | ReLU-Squared\u003Cbr>SwiGLU\u003Cbr>GeGLU | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-normalization](tasks\u002Fllm-pretrain-normalization) | Normalization and Block Layout | Studies how normalization placement, affine behavior, and transformer block layout affect pretraining stability and validation loss. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | RMSNorm\u003Cbr>RMSNorm + Sandwich-Norm\u003Cbr>RMSNorm (Parallel Block) | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-optimizer](tasks\u002Fllm-pretrain-optimizer) | Pretraining Optimizer Design | Studies how optimizer choice, parameter grouping, and schedule coupling affect autoregressive pretraining validation loss. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | AdamW + Nesterov\u003Cbr>Lion\u003Cbr>Muon | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-pretrain-residual](tasks\u002Fllm-pretrain-residual) | Transformer Residual Stream Strategy | Studies how residual connections and information flow across transformer layers affect validation loss, perplexity, and accuracy metrics. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | Vanilla (Pre-LN)\u003Cbr>ProRes\u003Cbr>Learned Scaling\u003Cbr>Block Attention Residuals | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| LM | [llm-rl-advantage](tasks\u002Fllm-rl-advantage) | Reasoning RL Advantage Estimation | Studies how advantage estimates for online language-model reinforcement learning affect mathematical reasoning accuracy. | [volcengine\u002Fverl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) | GRPO\u003Cbr>Dr. GRPO\u003Cbr>Reinforce++ Baseline | GSM8K\u003Cbr>MATH-500\u003Cbr>AMC |\n| LM | [llm-rl-importance-sampling](tasks\u002Fllm-rl-importance-sampling) | Reasoning RL Importance-Sampling Granularity | Studies how importance-sampling ratio granularity and clipping affect online language-model reinforcement learning for reasoning. | [volcengine\u002Fverl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) | Token-Level (Vanilla PPO)\u003Cbr>Sequence-Level (GSPO)\u003Cbr>First-K Tokens | GSM8K\u003Cbr>MATH-500\u003Cbr>AMC |\n| LM | [llm-rl-kl-estimator](tasks\u002Fllm-rl-kl-estimator) | Actor Divergence Estimator for Reasoning RL | Studies how per-token actor KL estimation controls reference-policy drift while preserving reasoning accuracy during online RL. | [volcengine\u002Fverl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) | K1 (Unbiased Log-Ratio)\u003Cbr>K2 (Squared Log-Ratio)\u003Cbr>K3 (Low-Variance KL)\u003Cbr>Absolute Log-Ratio | GSM8K\u003Cbr>MATH-500\u003Cbr>AMC |\n| LM | [llm-rl-reward-normalization](tasks\u002Fllm-rl-reward-normalization) | Pre-Advantage Reward Normalization | Studies how reward normalization before advantage estimation affects reasoning accuracy in online language-model RL. | [volcengine\u002Fverl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) | Outcome-Only (Raw)\u003Cbr>Group-Std Normalization\u003Cbr>Batch-Std Whitening\u003Cbr>Length-Aware Normalization | GSM8K\u003Cbr>MATH-500\u003Cbr>AMC |\n| LM | [llm-scaling-law-discovery](tasks\u002Fllm-scaling-law-discovery) | Symbolic Scaling-Law Discovery | Studies how symbolic functional forms and group-specific coefficients capture held-out scaling behavior. | [trevorstephens\u002Fgplearn](https:\u002F\u002Fgithub.com\u002Ftrevorstephens\u002Fgplearn) | Human Exact Form\u003Cbr>SLDAgent-Style\u003Cbr>Kernel Ridge Regression\u003Cbr>XGBoost | SLDBench Vocabulary Scaling\u003Cbr>SLDBench LR x Batch-Size Scaling\u003Cbr>SLDBench Data-Constrained Scaling |\n| LM | [mas-topology](tasks\u002Fmas-topology) | Language-Agent Collaboration Topology | Studies how deterministic collaboration topology affects multi-agent code-generation quality and execution success. | [OpenBMB\u002FChatDev](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FChatDev) | Chain\u003Cbr>Star\u003Cbr>Layered | HumanEval-33 (deepseek-chat, 4 agents)\u003Cbr>HumanEval-33 (qwen2.5-72b-instruct, 4 agents)\u003Cbr>SRDD-20 (deepseek-chat, 4 agents) |\n| Rob | [jepa-planning](tasks\u002Fjepa-planning) | Latent World-Model Planner | Studies how goal-conditioned planning should exploit a fixed latent world model to improve navigation success. | [facebookresearch\u002Feb_jepa](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Feb_jepa) | Random\u003Cbr>CEM\u003Cbr>MPPI\u003Cbr>iCEM | Two Rooms (Horizon 30)\u003Cbr>Two Rooms (Horizon 60)\u003Cbr>Two Rooms (Horizon 90) |\n| Rob | [jepa-prediction-loss](tasks\u002Fjepa-prediction-loss) | Temporal Latent Prediction Loss | Studies how latent prediction objectives affect multi-step video representation quality. | [facebookresearch\u002Feb_jepa](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Feb_jepa) | MSE\u003Cbr>Smooth L1\u003Cbr>Cosine | Moving MNIST AP (small: henc=16, dstc=8, hpre=16)\u003Cbr>Moving MNIST AP (base: henc=32, dstc=16, hpre=32)\u003Cbr>Moving MNIST AP (large: henc=64, dstc=32, hpre=64) |\n| Rob | [jepa-regularizer](tasks\u002Fjepa-regularizer) | Anti-Collapse Representation Regularizer | Studies how self-supervised regularization prevents representation collapse and improves linear-probe accuracy. | [facebookresearch\u002Feb_jepa](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Feb_jepa) | Naive\u003Cbr>VICReg\u003Cbr>SigReg\u003Cbr>Barlow Twins | ResNet-18 Probe\u003Cbr>ResNet-34 Probe\u003Cbr>ResNet-50 Probe |\n| Rob | [robo-diffusion-guidance](tasks\u002Frobo-diffusion-guidance) | Diffusion Guidance for Robot Trajectory Planning | Studies guidance mechanisms for a fixed trajectory-level diffusion planner on D4RL MuJoCo, optimizing normalized score across hopper-medium-v2, walker2d-medium-v2, and halfcheetah-medium-v2. | [CleanDiffuserTeam\u002FCleanDiffuser](https:\u002F\u002Fgithub.com\u002FCleanDiffuserTeam\u002FCleanDiffuser) | Diffuser (Classifier Guidance)\u003Cbr>Classifier-Free Guidance\u003Cbr>No Guidance\u003Cbr>Decision Diffuser | D4RL Hopper-Medium-v2\u003Cbr>D4RL Walker2d-Medium-v2\u003Cbr>D4RL HalfCheetah-Medium-v2 |\n| Rob | [robo-diffusion-policy](tasks\u002Frobo-diffusion-policy) | Diffusion Policy Learning for Robot Control | Studies how diffusion policy training, value guidance, and action generation affect robot-control episode reward. | [CleanDiffuserTeam\u002FCleanDiffuser](https:\u002F\u002Fgithub.com\u002FCleanDiffuserTeam\u002FCleanDiffuser) | DQL (Diffusion Q-Learning)\u003Cbr>IDQL\u003Cbr>Diffusion Policy | D4RL Hopper-Medium-v2\u003Cbr>D4RL Walker2d-Medium-v2\u003Cbr>D4RL HalfCheetah-Medium-v2 |\n| Rob | [robo-diffusion-sampling-method](tasks\u002Frobo-diffusion-sampling-method) | Efficient Diffusion Sampling for Robot Actions | Studies how solver choice and sampling_steps affect DQL-style diffusion-policy normalized score at low NFE on D4RL MuJoCo. | [CleanDiffuserTeam\u002FCleanDiffuser](https:\u002F\u002Fgithub.com\u002FCleanDiffuserTeam\u002FCleanDiffuser) | DDPM (100-Step Ancestral Sampling)\u003Cbr>DDIM (20-Step Deterministic Sampling)\u003Cbr>DPM-Solver++ 2M (10-Step) | D4RL Hopper-Medium-v2\u003Cbr>D4RL Walker2d-Medium-v2\u003Cbr>D4RL HalfCheetah-Medium-v2 |\n| Rob | [robo-humanoid-sim2real-algo](tasks\u002Frobo-humanoid-sim2real-algo) | Humanoid Transfer Policy Learning | Studies how actor-critic architecture, policy optimization, and rollout processing affect humanoid command-following transfer. | [roboterax\u002Fhumanoid-gym](https:\u002F\u002Fgithub.com\u002Froboterax\u002Fhumanoid-gym) | Default PPO\u003Cbr>PPO with Adaptive KL\u003Cbr>PPO with LayerNorm | RobotEra XBot-L Training\u003Cbr>RobotEra XBot-L \u002F Diverse Commands\u003Cbr>RobotEra XBot-L \u002F Forward-Only\u003Cbr>RobotEra XBot-L \u002F High Speed |\n| Rob | [robomimic-bc-loss](tasks\u002Frobomimic-bc-loss) | Behavioral Cloning Loss for Manipulation | Studies how imitation-learning loss design affects rollout success for low-dimensional robot manipulation tasks. | [ARISE-Initiative\u002Frobomimic](https:\u002F\u002Fgithub.com\u002FARISE-Initiative\u002Frobomimic) | NLL with Entropy\u003Cbr>Weighted NLL\u003Cbr>Default (NLL) | Tool Hang (PH)\u003Cbr>Can (PH)\u003Cbr>Square (PH) |\n| Rob | [robomimic-iql-vf](tasks\u002Frobomimic-iql-vf) | Offline Value Loss for Manipulation | Studies how asymmetric value regression loss design affects offline robot manipulation policy success. | [ARISE-Initiative\u002Frobomimic](https:\u002F\u002Fgithub.com\u002FARISE-Initiative\u002Frobomimic) | Quantile Regression\u003Cbr>Huber Pinball\u003Cbr>Default (Expectile) | Tool Hang (PH)\u003Cbr>Can (PH)\u003Cbr>Square (PH) |\n| Rob | [robomimic-obs-encoder](tasks\u002Frobomimic-obs-encoder) | Observation Fusion Encoder for Imitation Learning | Designs a multimodal robot state encoder for behavioral cloning to improve rollout success rate on manipulation tasks. | [ARISE-Initiative\u002Frobomimic](https:\u002F\u002Fgithub.com\u002FARISE-Initiative\u002Frobomimic) | Attention Fusion\u003Cbr>Gated Fusion\u003Cbr>Default (Concatenation) | Tool Hang (PH)\u003Cbr>Can (PH)\u003Cbr>Square (PH) |\n| Rob | [tdmpc2-planning](tasks\u002Ftdmpc2-planning) | Trajectory Optimization for Model-Based Planning | An online planning algorithm selects actions through learned-world-model trajectory optimization to improve episode reward. | [nicklashansen\u002Ftdmpc2](https:\u002F\u002Fgithub.com\u002Fnicklashansen\u002Ftdmpc2) | CEM\u003Cbr>iCEM\u003Cbr>MPPI | Walker Walk\u003Cbr>Cheetah Run\u003Cbr>Cartpole Swingup |\n| Rob | [tdmpc2-simnorm](tasks\u002Ftdmpc2-simnorm) | Latent Representation Normalization for Model-Based RL | Designs latent-state normalization for the TD-MPC2 encoder and dynamics world-model networks, evaluated by DMControl episode reward. | [nicklashansen\u002Ftdmpc2](https:\u002F\u002Fgithub.com\u002Fnicklashansen\u002Ftdmpc2) | SimNorm\u003Cbr>L2 normalization\u003Cbr>RMSNorm\u003Cbr>Identity (no normalization) | DMControl walker-walk\u003Cbr>DMControl cheetah-run\u003Cbr>DMControl cartpole-swingup |\n| V&G | [cv-3dgs-densification](tasks\u002Fcv-3dgs-densification) | 3D Gaussian Splatting Densification Strategy Design | Designs a 3D Gaussian Splatting densification strategy controlling clone, split, prune, reset, relocation, and sample-add behavior to improve held-out novel-view quality on Mip-NeRF 360 scenes. | [nerfstudio-project\u002Fgsplat](https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fgsplat) | Original 3DGS densification\u003Cbr>AbsGS + Taming-3DGS + New Split\u003Cbr>EDC-TamingGS-Abs | Mip-NeRF 360 garden (8x, best PSNR)\u003Cbr>Mip-NeRF 360 bicycle (8x, best PSNR)\u003Cbr>Mip-NeRF 360 bonsai (8x, best PSNR)\u003Cbr>Mip-NeRF 360 stump (8x, best PSNR) |\n| V&G | [cv-3dgs-regularizer](tasks\u002Fcv-3dgs-regularizer) | 3D Gaussian Splatting Regularizer Design | Designs a scalar regularizer added to the 3DGS photometric loss during 30k-step Mip-NeRF 360 reconstruction, evaluated on held-out novel views and scored by best PSNR. | [nerfstudio-project\u002Fgsplat](https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fgsplat) | No regularization\u003Cbr>Scale + opacity L1\u003Cbr>Effective-rank + scale\u002Fopacity L1 | Mip-NeRF 360 garden (8x, best PSNR)\u003Cbr>Mip-NeRF 360 bicycle (8x, best PSNR)\u003Cbr>Mip-NeRF 360 bonsai (8x, best PSNR)\u003Cbr>Mip-NeRF 360 stump (8x, best PSNR) |\n| V&G | [cv-dbm-sampler](tasks\u002Fcv-dbm-sampler) | Custom Sampler for Diffusion Bridge Models | Designs a low-NFE sampler for Diffusion Bridge Models on image-to-image translation, ImageNet center-inpainting, and DIODE depth, evaluated by FID at NFE=5. | [thu-ml\u002FDiffusionBridge](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FDiffusionBridge) | DBIM\u003Cbr>DBIM-HO (high-order)\u003Cbr>DDBM (50 NFE reference)\u003Cbr>ECSI | Edges2Handbags \u002F e2h (FID, NFE=5)\u003Cbr>ImageNet center-inpaint (FID, NFE=5)\u003Cbr>DIODE depth (FID, NFE=5) |\n| V&G | [cv-dbm-scheduler](tasks\u002Fcv-dbm-scheduler) | Time Scheduler for Diffusion Bridge Models (NFE=5) | Designs a monotone low-step time schedule for Diffusion Bridge Models, evaluated by FID on Edges2Handbags, ImageNet center-inpainting, and DIODE depth at NFE=5. | [thu-ml\u002FDiffusionBridge](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FDiffusionBridge) | Karras EDM (rho=7)\u003Cbr>Uniform (linear)\u003Cbr>Cosine (Nichol-Dhariwal)\u003Cbr>Log-linear (geometric) | Edges2Handbags \u002F e2h (FID, NFE=5)\u003Cbr>ImageNet center-inpaint (FID, NFE=5)\u003Cbr>DIODE depth (FID, NFE=5) |\n| V&G | [cv-diffusion-architecture](tasks\u002Fcv-diffusion-architecture) | Diffusion Model Architecture Design | Design a denoising UNet backbone for unconditional CIFAR-10 DDPM training, optimizing best FID with fixed epsilon prediction and 50-step DDIM sampling. | [huggingface\u002Fdiffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers) | Standard DDPM U-Net\u003Cbr>Full-Attention U-Net\u003Cbr>No-Attention U-Net | CIFAR-10 DDPM Small\u003Cbr>CIFAR-10 DDPM Medium\u003Cbr>CIFAR-10 DDPM Large |\n| V&G | [cv-diffusion-cfg](tasks\u002Fcv-diffusion-cfg) | Diffusion Model: Classifier-Free Guidance Optimization | Design a classifier-free guidance method for Stable Diffusion text-to-image generation across SD v1.5, Stable Diffusion 2 Base, and Stable Diffusion XL; evaluation generates COCO-caption images and official scoring uses per-model FID. | [CFGpp-diffusion\u002FCFGpp](https:\u002F\u002Fgithub.com\u002FCFGpp-diffusion\u002FCFGpp) | Standard CFG\u003Cbr>CFG++\u003Cbr>Zero-Init CFG++ | Stable Diffusion v1.5 \u002F COCO captions \u002F NFE=10\u003Cbr>Stable Diffusion 2 Base \u002F COCO captions \u002F NFE=10\u003Cbr>Stable Diffusion XL Base 1.0 \u002F COCO captions \u002F NFE=10 |\n| V&G | [cv-diffusion-conditioning](tasks\u002Fcv-diffusion-conditioning) | Class-Conditional Diffusion: Conditioning Injection Methods | Design class-conditioning injection for a CIFAR-10 class-conditional UNet2DModel\u002FDDPM, optimizing best FID with 50-step DDIM sampling. | [huggingface\u002Fdiffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers) | Concat-FiLM\u003Cbr>Cross-Attention\u003Cbr>AdaLN-Zero | CIFAR-10 Class-Conditional Small UNet2DModel\u003Cbr>CIFAR-10 Class-Conditional Medium UNet2DModel\u003Cbr>CIFAR-10 Class-Conditional Large UNet2DModel |\n| V&G | [cv-diffusion-efficiency](tasks\u002Fcv-diffusion-efficiency) | Diffusion Model: Sampler Efficiency Optimization | Design a Stable Diffusion sampler update rule for COCO-caption text-to-image generation at a fixed NFE=20 budget; official scoring uses per-model FID. | [CFGpp-diffusion\u002FCFGpp](https:\u002F\u002Fgithub.com\u002FCFGpp-diffusion\u002FCFGpp) | DDIM\u003Cbr>DPM++ 3M\u003Cbr>DPM++ 2S | Stable Diffusion v1.5 \u002F COCO captions \u002F NFE=20\u003Cbr>Stable Diffusion 2 Base \u002F COCO captions \u002F NFE=20\u003Cbr>Stable Diffusion XL Base 1.0 \u002F COCO captions \u002F NFE=20 |\n| V&G | [cv-diffusion-prediction](tasks\u002Fcv-diffusion-prediction) | Diffusion Prediction Parameterization | Design a prediction target and consistent x0 inversion for unconditional CIFAR-10 UNet2DModel diffusion, optimizing best FID with 50-step DDIM sampling. | [huggingface\u002Fdiffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers) | Epsilon Prediction\u003Cbr>V-Prediction\u003Cbr>X0 Prediction | CIFAR-10 Unconditional Small UNet2DModel\u003Cbr>CIFAR-10 Unconditional Medium UNet2DModel\u003Cbr>CIFAR-10 Unconditional Large UNet2DModel |\n| V&G | [cv-meanflow-perceptual-loss](tasks\u002Fcv-meanflow-perceptual-loss) | Flow Map with Perceptual Loss | Studies whether auxiliary perceptual losses on denoised images improve CIFAR-10 FID for MeanFlow flow-map training with DiT backbones. | [snap-research\u002Falphaflow](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002Falphaflow) | Pure MSE Velocity\u003Cbr>MSE + Charbonnier + LPIPS + Gradient + Multiscale\u003Cbr>MSE + LPIPS + Gradient + Multiscale + FFT | CIFAR-10 Small DiT\u003Cbr>CIFAR-10 Medium DiT\u003Cbr>CIFAR-10 Large DiT |\n| V&G | [cv-vae-loss](tasks\u002Fcv-vae-loss) | VAE Loss Function Design for Image Reconstruction | Studies how VAE loss components affect CIFAR-10 AutoencoderKL reconstruction quality, scored primarily by rFID on the full test set. | [huggingface\u002Fdiffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers) | L1 + KL\u003Cbr>L1 + LPIPS + KL\u003Cbr>L1 + LPIPS + KL + PatchGAN | CIFAR-10 AutoencoderKL Small\u003Cbr>CIFAR-10 AutoencoderKL Medium\u003Cbr>CIFAR-10 AutoencoderKL Large |\n| RL | [marl-centralized-critic](tasks\u002Fmarl-centralized-critic) | Cooperative MARL Centralized Critic Architecture for MAPPO | Studies centralized critic architectures for MAPPO on SMACLite cooperative MARL maps, scored by greedy-policy test win rate and return. | [uoe-agents\u002Fepymarl](https:\u002F\u002Fgithub.com\u002Fuoe-agents\u002Fepymarl) | IPPO Decentralized Critic\u003Cbr>MAPPO Centralized Critic\u003Cbr>MAT-Style Attention Critic | SMACLite MMM (10-agent heterogeneous)\u003Cbr>SMACLite 2s3z (5-agent heterogeneous)\u003Cbr>SMACLite 3s5z (8-agent heterogeneous) |\n| RL | [meta-rl](tasks\u002Fmeta-rl) | Meta-RL: Context Encoder for PEARL Task Inference | Studies PEARL context encoders that map transition tuples to latent task representations for fast adaptation, evaluated by meta_test_return after 20 meta-training iterations. | [katerakelly\u002Foyster](https:\u002F\u002Fgithub.com\u002Fkaterakelly\u002Foyster) | PEARL MLP Context Encoder\u003Cbr>PEARL Recurrent Context Encoder\u003Cbr>PEARL Attention Context Encoder | Half-Cheetah Velocity (30 train\u002F10 test tasks)\u003Cbr>Sparse Point Robot (40 train\u002F10 test tasks)\u003Cbr>Point Robot (40 train\u002F10 test tasks) |\n| RL | [meta-rl-algorithm](tasks\u002Fmeta-rl-algorithm) | Meta-RL Algorithm Design | Studies complete meta-RL algorithm design across task inference, policy conditioning, and meta-training, scored by meta_test_return on held-out tasks after the fixed short-budget protocol. | [katerakelly\u002Foyster](https:\u002F\u002Fgithub.com\u002Fkaterakelly\u002Foyster) | PEARL\u003Cbr>FOCAL\u003Cbr>VariBAD | Half-Cheetah Velocity (30 train\u002F10 test tasks)\u003Cbr>Sparse Point Robot (40 train\u002F10 test tasks)\u003Cbr>Point Robot (40 train\u002F10 test tasks) |\n| RL | [rl-intrinsic-exploration](tasks\u002Frl-intrinsic-exploration) | Intrinsic Exploration for Sparse Rewards | Studies how intrinsic rewards and advantage mixing affect exploration and return in sparse-reward Atari environments. | [vwxyzjn\u002Fcleanrl](https:\u002F\u002Fgithub.com\u002Fvwxyzjn\u002Fcleanrl) | PPO\u003Cbr>RND\u003Cbr>ICM | Tutankham-v5\u003Cbr>Frostbite-v5\u003Cbr>PrivateEye-v5 |\n| RL | [rl-offline-adroit](tasks\u002Frl-offline-adroit) | Offline Dexterous Manipulation from Narrow Demonstrations | Studies how offline RL algorithms learn dexterous manipulation from narrow human demonstration datasets. | [corl-team\u002FCORL](https:\u002F\u002Fgithub.com\u002Fcorl-team\u002FCORL) | IQL\u003Cbr>AWAC\u003Cbr>ReBRAC | Pen-Human-v1\u003Cbr>Hammer-Human-v1\u003Cbr>Door-Cloned-v1 |\n| RL | [rl-offline-continuous](tasks\u002Frl-offline-continuous) | Q-Overestimation Suppression for Offline Continuous Control | Studies how offline continuous-control algorithms suppress out-of-distribution Q-value overestimation. | [corl-team\u002FCORL](https:\u002F\u002Fgithub.com\u002Fcorl-team\u002FCORL) | ReBRAC\u003Cbr>TD3-BC\u003Cbr>IQL | HalfCheetah-Medium-v2\u003Cbr>Maze2D-Medium-v1\u003Cbr>Walker2d-Medium-v2 |\n| RL | [rl-offline-off2on](tasks\u002Frl-offline-off2on) | Offline-to-Online Fine-Tuning Without Forgetting | Studies how offline-to-online reinforcement learning prevents forgetting and value collapse during continued interaction. | [corl-team\u002FCORL](https:\u002F\u002Fgithub.com\u002Fcorl-team\u002FCORL) | IQL\u003Cbr>AWAC\u003Cbr>SPOT | Pen-Cloned-v1\u003Cbr>Hammer-Cloned-v1\u003Cbr>Hammer-Expert-v1 |\n| RL | [rl-offpolicy-continuous](tasks\u002Frl-offpolicy-continuous) | Off-Policy Actor-Critic for Continuous Control | Changes off-policy actor-critic update rules, losses, or exploration strategies to improve mean episodic return on continuous-control tasks. | [vwxyzjn\u002Fcleanrl](https:\u002F\u002Fgithub.com\u002Fvwxyzjn\u002Fcleanrl) | DDPG\u003Cbr>TD3\u003Cbr>SAC | HalfCheetah-v4\u003Cbr>Reacher-v4\u003Cbr>Ant-v4 |\n| RL | [rl-onpolicy-continuous](tasks\u002Frl-onpolicy-continuous) | On-Policy Actor-Critic for Continuous Control | Changes on-policy actor-critic objectives, update rules, or exploration mechanisms to improve mean episodic return on continuous-control tasks. | [vwxyzjn\u002Fcleanrl](https:\u002F\u002Fgithub.com\u002Fvwxyzjn\u002Fcleanrl) | PPO\u003Cbr>AWR\u003Cbr>PPO (KL Penalty) | HalfCheetah-v4\u003Cbr>Swimmer-v4\u003Cbr>InvertedDoublePendulum-v4 |\n| RL | [rl-reward-learning](tasks\u002Frl-reward-learning) | Inverse RL Reward Learning from Demonstrations | Studies how reward models learned from expert demonstrations affect downstream policy return in continuous-control locomotion. | [HumanCompatibleAI\u002Fimitation](https:\u002F\u002Fgithub.com\u002FHumanCompatibleAI\u002Fimitation) | GAIL\u003Cbr>AIRL\u003Cbr>BC | HalfCheetah-v4\u003Cbr>Hopper-v4\u003Cbr>Walker2d-v4 |\n| RL | [rl-value-atari](tasks\u002Frl-value-atari) | Value-Based Visual Control | Studies how value-based RL losses, update rules, and exploration strategies affect visual-control episodic return. | [vwxyzjn\u002Fcleanrl](https:\u002F\u002Fgithub.com\u002Fvwxyzjn\u002Fcleanrl) | QR-DQN\u003Cbr>C51\u003Cbr>Double-DQN | BreakoutNoFrameskip-v4\u003Cbr>SeaquestNoFrameskip-v4\u003Cbr>PongNoFrameskip-v4 |\n| RL | [rl-value-discrete](tasks\u002Frl-value-discrete) | Value-Based Discrete Control | Changes value estimation, uncertainty handling, or replay-based update rules to improve episodic return on discrete-action control tasks. | [vwxyzjn\u002Fcleanrl](https:\u002F\u002Fgithub.com\u002Fvwxyzjn\u002Fcleanrl) | QR-DQN\u003Cbr>Dueling-DQN\u003Cbr>C51 | CartPole-v1\u003Cbr>LunarLander-v2\u003Cbr>Acrobot-v1 |\n| RL | [safe-rl](tasks\u002Fsafe-rl) | Constraint Handling for Safe RL | Changes Lagrangian or controller-style multiplier updates and cost-reward advantage mixing to improve reward while keeping episode cost below target. | [PKU-Alignment\u002Fomnisafe](https:\u002F\u002Fgithub.com\u002FPKU-Alignment\u002Fomnisafe) | Naive PPO\u003Cbr>Lagrangian PPO\u003Cbr>PID Lagrangian | SafetyPointGoal1-v0\u003Cbr>SafetyCarGoal1-v0\u003Cbr>SafetyPointButton1-v0 |\n| Sys | [dlm-dkv-policy](tasks\u002Fdlm-dkv-policy) | Diffusion LM KV Cache Policy | Studies how token-state refresh intervals, masks, transfer ratios, and fallbacks affect denoising quality and cache reuse. | [maomaocun\u002FdLLM-Cache](https:\u002F\u002Fgithub.com\u002Fmaomaocun\u002FdLLM-Cache) | Vanilla (Uncached)\u003Cbr>dLLM-Cache\u003Cbr>d2Cache\u003Cbr>Elastic-Cache | MATH-500\u003Cbr>HumanEval\u003Cbr>ARC-Challenge |\n| Sys | [llm-kv-adaptive-quantization](tasks\u002Fllm-kv-adaptive-quantization) | LLM KV Cache: Adaptive Quantization Policy | Studies adaptive 4-bit KV-cache quantization for instruction-tuned long-context inference, trading benchmark final-score quality against effective KV bits and compression. | [huggingface\u002Ftransformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) | KIVI Overlap (4-bit)\u003Cbr>KVTuner-4 Per-Token\u003Cbr>KVTuner-4 KIVI\u003Cbr>SQuat Subspace (4-bit) | LongBench-E hotpotqa_e QA F1\u003Cbr>LongBench-E passage_retrieval_en_e retrieval score\u003Cbr>LongBench-E repobench-p_e code-similarity score\u003Cbr>NeedleBench NIAH exact phrase retrieval\u003Cbr>GSM8K exact final-answer accuracy |\n| Sys | [llm-kv-selection-budgeting](tasks\u002Fllm-kv-selection-budgeting) | LLM KV Cache Selection Budgeting | Studies how selection and eviction controllers allocate layer budgets and recent windows for quality, latency, and memory tradeoffs. | [huggingface\u002Ftransformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) | Full Attention\u003Cbr>StreamingLLM\u003Cbr>Expected Attention\u003Cbr>LagKV | LongBench-E hotpotqa_e QA F1\u003Cbr>LongBench-E passage_retrieval_en_e retrieval score\u003Cbr>LongBench-E repobench-p_e code-similarity score\u003Cbr>LongBench v2 train split multiple-choice accuracy\u003Cbr>GSM8K exact final-answer accuracy |\n| Sys | [llm-kv-structural-reduction](tasks\u002Fllm-kv-structural-reduction) | LLM Pretraining: KV-Structural Reduction | Studies GPT-style KV-state structural reduction through MHA, MQA, GQA, and MLA-style latent KV compression under fixed nanoGPT pretraining. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | MHA\u003Cbr>MQA\u003Cbr>GQA\u003Cbr>MLA | ClimbMix val loss + KV bytes\u002Ftoken + WikiText-2\u002FWikiText-103\u002FLAMBADA heldout loss\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| Sys | [llm-pretrain-kernel](tasks\u002Fllm-pretrain-kernel) | LLM Pretraining: Custom GPU Kernel Optimization | Studies custom\u002Ffused MLP kernels for nanoGPT pretraining while preserving ClimbMix validation, held-out perplexity, and downstream lm-eval quality. | [karpathy\u002FnanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)\u003Cbr>[EleutherAI\u002Flm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) | ReLU-Squared (Torch)\u003Cbr>Triton GELU\u003Cbr>Triton ReLU-Squared (Fused) | ClimbMix val loss + WikiText-2\u002FLAMBADA PPL\u003Cbr>HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |\n| Sys | [llm-ptq-algorithm](tasks\u002Fllm-ptq-algorithm) | LLM Post-Training Quantization (PTQ) Algorithm | Design a post-training quantization algorithm for a pretrained LLM that minimizes WikiText-2 perplexity degradation under INT4\u002FINT3 group quantization without retraining. | [IST-DASLab\u002Fgptq](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq) | Round-to-Nearest (RTN)\u003Cbr>GPTQ\u003Cbr>AWQ | PTQ INT4\u003Cbr>PTQ INT3\u003Cbr>PTQ INT4 (g64) |\n| Sys | [llm-qat-algorithm](tasks\u002Fllm-qat-algorithm) | LLM Quantization-Aware Training (QAT) Algorithm | Design a quantization-aware training algorithm for a pretrained LLM that minimizes WikiText-2 perplexity after INT4\u002FINT3\u002FINT2 quantization at inference time. | custom | No QAT\u003Cbr>STE\u003Cbr>LSQ\u003Cbr>Finetune + PTQ | QAT INT4\u003Cbr>QAT INT3\u003Cbr>QAT INT2 |\n| Sys | [mlsys-fused-attention](tasks\u002Fmlsys-fused-attention) | Fused Attention Kernel Design for H100 GPUs | Design an OpenAI Triton fused self-attention forward kernel for H100 GPUs that maximizes throughput (TFLOPs\u002Fs) while preserving numerical correctness. | [Dao-AILab\u002Fflash-attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) | FlashAttention\u003Cbr>FlashAttention-2\u003Cbr>FlashAttention-3 | Head Dim 64 \u002F Seq 4K\u003Cbr>Head Dim 128 \u002F Seq 8K\u003Cbr>Head Dim 256 \u002F Seq 16K |\n| Sys | [mlsys-moe-load-balance](tasks\u002Fmlsys-moe-load-balance) | MoE Expert Parallelism Load Balancing | Design an efficient MoE expert-replica placement algorithm that minimizes GPU\u002Fnode load imbalance while preserving inter-node locality and low runtime. | [deepseek-ai\u002Feplb](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002Feplb) | Greedy\u003Cbr>Zigzag\u003Cbr>Flat Zigzag | DeepSeek-V3\u003Cbr>Qwen3-MoE\u003Cbr>DeepSeek-V2\u003Cbr>Stress-Skew |\n| Sys | [mlsys-sparse-attention-inference](tasks\u002Fmlsys-sparse-attention-inference) | Long-Context Inference-Time Sparse Attention | Design an inference-time sparse attention module for a pretrained instruction-tuned causal LLM that preserves NIAH and LongBench quality under a 25% density budget without retraining. | custom | Dense\u003Cbr>StreamingLLM\u003Cbr>BigBird\u003Cbr>Block Top-K | NIAH (8K)\u003Cbr>LongBench Qasper\u003Cbr>LongBench MultiFieldQA-EN |\n| Sci | [ai4bio-mutation-effect-prediction](tasks\u002Fai4bio-mutation-effect-prediction) | Mutation Fitness Predictor | Studies how mutant and wild-type protein representations can predict functional effects of sequence mutations. | [OATML-Markslab\u002FProteinGym](https:\u002F\u002Fgithub.com\u002FOATML-Markslab\u002FProteinGym) | Ridge Regression\u003Cbr>MLP\u003Cbr>Reshape CNN | BLAT_ECOLX\u003Cbr>ESTA_BACSU\u003Cbr>RASH_HUMAN |\n| Sci | [ai4bio-protein-inverse-folding](tasks\u002Fai4bio-protein-inverse-folding) | Backbone-to-Sequence Inverse Folding | Studies how geometric structure encoding and sequence decoding recover amino-acid sequences from protein backbones. | [A4Bio\u002FProteinInvBench](https:\u002F\u002Fgithub.com\u002FA4Bio\u002FProteinInvBench) | ProteinMPNN\u003Cbr>PiFold\u003Cbr>GVP | CATH 4.2\u003Cbr>CATH 4.3\u003Cbr>TS50 |\n| Sci | [ai4bio-protein-structure-repr](tasks\u002Fai4bio-protein-structure-repr) | Geometric Protein Structure Encoder | Studies how local and global geometric protein representations transfer to structure-aware function prediction. | [a-r-j\u002FProteinWorkshop](https:\u002F\u002Fgithub.com\u002Fa-r-j\u002FProteinWorkshop) | SchNet\u003Cbr>EGNN\u003Cbr>GearNet | EC\u003Cbr>GO-BP\u003Cbr>Fold |\n| Sci | [ai4sci-climate-emulation](tasks\u002Fai4sci-climate-emulation) | Atmospheric Column Emulator Architecture | Studies how neural emulator architecture maps vertical atmospheric states to sub-grid physics tendencies across training budgets. | [leap-stc\u002FClimSim](https:\u002F\u002Fgithub.com\u002Fleap-stc\u002FClimSim) | CNN\u003Cbr>Encoder-Decoder\u003Cbr>U-Net\u003Cbr>HSR | Short Budget\u003Cbr>Medium Budget\u003Cbr>Long Budget |\n| Sci | [ai4sci-inverse-diffusion-algo](tasks\u002Fai4sci-inverse-diffusion-algo) | Diffusion-Prior Inverse Solver | Studies how diffusion priors and measurement guidance can be combined for inverse-problem reconstruction. | [devzhk\u002FInverseBench](https:\u002F\u002Fgithub.com\u002Fdevzhk\u002FInverseBench) | DPS\u003Cbr>REDDiff\u003Cbr>LGD | Inverse Scattering\u003Cbr>Black Hole Imaging\u003Cbr>Inpainting |\n| Sci | [ai4sci-mol-property-prediction](tasks\u002Fai4sci-mol-property-prediction) | Molecular Representation Predictor | Studies how molecular graph and geometric representations improve property prediction under scaffold-based generalization. | [deepmodeling\u002FUni-Mol](https:\u002F\u002Fgithub.com\u002Fdeepmodeling\u002FUni-Mol) | D-MPNN\u003Cbr>Uni-Mol\u003Cbr>GIN | BBBP\u003Cbr>BACE\u003Cbr>Tox21 |\n| Sci | [ai4sci-pla-binding-affinity](tasks\u002Fai4sci-pla-binding-affinity) | Protein-Ligand Interaction Model | Studies how intra- and inter-molecular geometric interactions should be represented to predict binding affinity. | [guaguabujianle\u002FEHIGN_PLA](https:\u002F\u002Fgithub.com\u002Fguaguabujianle\u002FEHIGN_PLA) | EHIGN\u003Cbr>GIGN\u003Cbr>SchNet\u003Cbr>EGNN | PDBbind 2013\u003Cbr>PDBbind 2016\u003Cbr>PDBbind 2019 |\n| Sci | [ai4sci-vs-contrastive-scoring](tasks\u002Fai4sci-vs-contrastive-scoring) | Contrastive Virtual-Screening Objective | Studies how projection geometry and contrastive losses affect zero-shot protein-ligand screening quality. | [jianhuiwemi\u002FHypSeek](https:\u002F\u002Fgithub.com\u002Fjianhuiwemi\u002FHypSeek) | Vanilla CLIP\u003Cbr>HCC\u003Cbr>HCC + Hyperbolic Cone | HypSeek Training\u003Cbr>DUD-E\u003Cbr>LIT-PCBA\u003Cbr>DEKOIS 2.0 |\n| Sci | [ai4sci-weather-forecast-aggregation](tasks\u002Fai4sci-weather-forecast-aggregation) | Weather Forecast Variable Aggregation | Studies how weather forecasting models aggregate information across heterogeneous meteorological variables for optimal prediction. | [microsoft\u002FClimaX](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FClimaX) | Cross-Attention\u003Cbr>Mean Pooling\u003Cbr>Learned Weighted Sum | Z500 3-Day\u003Cbr>T850 5-Day\u003Cbr>10m-Wind 7-Day |\n| Sci | [pde-design-solver](tasks\u002Fpde-design-solver) | Industrial CFD Design: Custom Neural Operator Design | Designs and implements a custom neural operator for industrial aerodynamic design prediction on 3D unstructured point clouds. | [thuml\u002FNeural-Solver-Library](https:\u002F\u002Fgithub.com\u002Fthuml\u002FNeural-Solver-Library) | PointNet\u003Cbr>GraphSAGE\u003Cbr>Graph U-Net\u003Cbr>Transolver | Car Design\u003Cbr>AirfRANS\u003Cbr>Aircraft Design |\n| Opt | [optimization-bilevel](tasks\u002Foptimization-bilevel) | Optimization Bilevel | Studies a fixed bilevel-optimization benchmark based on Shen and Chen's penalty-based bilevel gradient descent experiments, selecting supported methods and tuning paper-style strategy hyperparameters. | [hanshen95\u002Fpenalized-bilevel-gradient-descent](https:\u002F\u002Fgithub.com\u002Fhanshen95\u002Fpenalized-bilevel-gradient-descent) | V-PBGD\u003Cbr>G-PBGD\u003Cbr>RHG\u003Cbr>T-RHG | Toy Convergence\u003Cbr>HyperClean (Linear)\u003Cbr>HyperClean (MLP) |\n| Opt | [optimization-convex-concave](tasks\u002Foptimization-convex-concave) | RAIN Convex-Concave | Studies gradient-norm convergence on the exact convex-concave benchmark instances used by the official RAIN bilinear and delta-function scripts. | [TrueNobility303\u002FRAIN](https:\u002F\u002Fgithub.com\u002FTrueNobility303\u002FRAIN) | SEG\u003Cbr>R-SEG\u003Cbr>SEAG\u003Cbr>RAIN | Default Noise\u003Cbr>Low Noise\u003Cbr>High Noise |\n| Opt | [optimization-diagonal-net](tasks\u002Foptimization-diagonal-net) | Optimizer Design for Diagonal-Net Sparse Recovery | Designs an optimizer that recovers a sparse linear predictor from fewer training samples under a diagonal-net parameterization with noisy labels. | [TrueNobility303\u002FRAIN](https:\u002F\u002Fgithub.com\u002FTrueNobility303\u002FRAIN) | SGD\u003Cbr>AdaGrad\u003Cbr>Adam\u003Cbr>Adam (Alt.) | d=200, k=5, s=0.1\u003Cbr>d=500, k=10, s=0.1\u003Cbr>d=500, k=10, s=0.2\u003Cbr>d=10000, k=50 |\n| Opt | [optimization-dp-sgd](tasks\u002Foptimization-dp-sgd) | Differentially Private SGD: Privacy-Utility Optimization | Design an improved DP-SGD variant that achieves higher test accuracy under the same (epsilon, delta)-differential privacy budget. | custom | Standard DP-SGD\u003Cbr>Automatic Clipping (AUTO-S)\u003Cbr>Adaptive Quantile Clipping\u003Cbr>Step-Decay Noise Schedule | MNIST\u003Cbr>Fashion-MNIST\u003Cbr>CIFAR-10 |\n| Opt | [optimization-evolution-strategy](tasks\u002Foptimization-evolution-strategy) | Evolutionary Optimization Strategy Design | Design a novel combination of selection, crossover, mutation operators and\u002For evolutionary loop for continuous black-box optimization across multiple benchmark functions. | [DEAP\u002Fdeap](https:\u002F\u002Fgithub.com\u002FDEAP\u002Fdeap) | GA (SBX)\u003Cbr>CMA-ES\u003Cbr>Differential Evolution\u003Cbr>L-SHADE | Rastrigin (30D)\u003Cbr>Rosenbrock (30D)\u003Cbr>Ackley (30D)\u003Cbr>Rastrigin (100D) |\n| Opt | [optimization-gradient-compression](tasks\u002Foptimization-gradient-compression) | Gradient Compression for Communication-Efficient Distributed Training | Design a gradient compression operator that reduces communication cost in distributed training while maintaining convergence quality. | custom | TopK Sparsification with Error Feedback\u003Cbr>QSGD (Quantized SGD)\u003Cbr>SignSGD | ResNet-20 \u002F CIFAR-10\u003Cbr>VGG-11-BN \u002F CIFAR-100\u003Cbr>ResNet-56 \u002F CIFAR-10 |\n| Opt | [optimization-hyperparameter-search](tasks\u002Foptimization-hyperparameter-search) | Hyperparameter Optimization: Custom Search Strategy Design | Design a custom HPO strategy that improves final validation score and convergence under limited multi-fidelity evaluation budgets. | custom | Random Search\u003Cbr>TPE\u003Cbr>Hyperband\u003Cbr>DEHB\u003Cbr>BOHB\u003Cbr>Optuna CMA-ES | XGBoost\u003Cbr>SVM\u003Cbr>Neural Net |\n| Opt | [optimization-multi-objective](tasks\u002Foptimization-multi-objective) | Multi-Objective Optimization: Custom Evolutionary Strategy Design | Design a custom multi-objective evolutionary strategy that improves convergence, diversity, and spread on standard benchmark problems. | [DEAP\u002Fdeap](https:\u002F\u002Fgithub.com\u002FDEAP\u002Fdeap) | NSGA-II\u003Cbr>MOEA\u002FD\u003Cbr>SPEA2\u003Cbr>NSGA-III\u003Cbr>RVEA\u003Cbr>AGE-MOEA | ZDT1\u003Cbr>ZDT3\u003Cbr>DTLZ2\u003Cbr>DTLZ1 |\n| Opt | [optimization-nas](tasks\u002Foptimization-nas) | Sample-Efficient Neural Architecture Search | Design and implement a sample-efficient NAS optimizer that discovers high-performing architectures in the NAS-Bench-201 search space under a strict query budget. | [automl\u002Fnaslib](https:\u002F\u002Fgithub.com\u002Fautoml\u002Fnaslib) | Random Search\u003Cbr>REA\u003Cbr>BANANAS | CIFAR-10\u003Cbr>CIFAR-100\u003Cbr>ImageNet16-120 |\n| Opt | [optimization-online-bandit](tasks\u002Foptimization-online-bandit) | Online Bandits: Exploration-Exploitation Strategy Design | Design and implement a bandit policy that minimizes cumulative regret across diverse multi-armed bandit settings. | [SMPyBandits\u002FSMPyBandits](https:\u002F\u002Fgithub.com\u002FSMPyBandits\u002FSMPyBandits) | UCB1\u003Cbr>Thompson Sampling\u003Cbr>KL-UCB | Stochastic MAB\u003Cbr>Contextual Bandit\u003Cbr>Non-Stationary Bandit |\n| Opt | [optimization-pac-bayes-bound](tasks\u002Foptimization-pac-bayes-bound) | PAC-Bayes Generalization Bound Optimization | Design a tighter PAC-Bayes generalization bound by optimizing the bound formulation, prior\u002Fposterior parameterization, and KL divergence estimation for stochastic neural networks. | [mperezortiz\u002FPBB](https:\u002F\u002Fgithub.com\u002Fmperezortiz\u002FPBB) | McAllester\u003Cbr>Catoni\u003Cbr>Quadratic | MNIST (FCN)\u003Cbr>MNIST (CNN)\u003Cbr>FashionMNIST (CNN) |\n| Opt | [optimization-parity](tasks\u002Foptimization-parity) | Optimization Parity | Improve a fixed two-layer MLP's ability to learn sparse parity by designing only its initialization, training dataset, and AdamW hyperparameters. | [pytorch\u002Fexamples](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fexamples) | Default\u003Cbr>Multi-Epoch\u003Cbr>No Weight Decay | n=32, k=8\u003Cbr>n=50, k=8\u003Cbr>n=64, k=8 |\n| Opt | [optimization-variance-reduction](tasks\u002Foptimization-variance-reduction) | Variance Reduction for Stochastic Optimization | Design an improved variance reduction strategy for stochastic gradient descent on finite-sum optimization problems. | custom | SVRG\u003Cbr>STORM\u003Cbr>STORM+ | Logistic Regression\u003Cbr>MLP\u003Cbr>Ill-Conditioned |\n| CAL | [meta-fewshot-classification](tasks\u002Fmeta-fewshot-classification) | Few-Shot Image Classification Method | Studies how support encoding, query comparison, and loss design affect episodic few-shot image-classification accuracy. | [sicara\u002Feasy-few-shot-learning](https:\u002F\u002Fgithub.com\u002Fsicara\u002Feasy-few-shot-learning) | ProtoNet\u003Cbr>MatchingNet\u003Cbr>RelationNet | Mini-ImageNet 5w-5s\u003Cbr>CIFAR-FS\u003Cbr>CUB |\n| CAL | [meta-inner-loop-optimizer](tasks\u002Fmeta-inner-loop-optimizer) | Meta-Learning Inner-Loop Optimizer | Studies how differentiable inner-loop adaptation rules affect few-shot classification accuracy in gradient-based meta-learning. | [learnables\u002Flearn2learn](https:\u002F\u002Fgithub.com\u002Flearnables\u002Flearn2learn) | MAML\u003Cbr>Meta-SGD\u003Cbr>ANIL | Mini-ImageNet 5w-1s\u003Cbr>Mini-ImageNet 5w-5s\u003Cbr>CIFAR-FS 5w-5s |\n| CAL | [ml-active-learning](tasks\u002Fml-active-learning) | Pool-Based Active Learning Query Strategy | Studies how unlabeled-sample query rules affect accuracy under a fixed labeling budget. | [JordanAsh\u002Fbadge](https:\u002F\u002Fgithub.com\u002FJordanAsh\u002Fbadge) | BADGE\u003Cbr>BAIT\u003Cbr>BALD\u003Cbr>Least Confidence\u003Cbr>Random | Letter\u003Cbr>Spambase\u003Cbr>Splice |\n| CAL | [ml-anomaly-detection](tasks\u002Fml-anomaly-detection) | Unsupervised Tabular Anomaly Detector | Studies how unlabeled anomaly scoring algorithms identify outliers across tabular data distributions. | custom | IF (Isolation Forest)\u003Cbr>LOF\u003Cbr>OCSVM\u003Cbr>ECOD\u003Cbr>COPOD | Cardio\u003Cbr>Thyroid\u003Cbr>Satellite\u003Cbr>Shuttle |\n| CAL | [ml-calibration](tasks\u002Fml-calibration) | Post-Hoc Probability Calibration Mapping | Studies how post-hoc probability transforms improve classifier confidence calibration. | custom | Platt\u003Cbr>Temperature Scaling\u003Cbr>Isotonic Regression | RF \u002F MNIST\u003Cbr>MLP \u002F Fashion-MNIST\u003Cbr>GBM \u002F Madelon\u003Cbr>SVM \u002F Breast Cancer |\n| CAL | [ml-clustering-algorithm](tasks\u002Fml-clustering-algorithm) | Geometry-Robust Clustering Algorithm | Studies how clustering objectives and distance metrics handle convex blobs, non-convex moons, and high-dimensional digit data. | custom | K-Means\u003Cbr>DBSCAN\u003Cbr>HDBSCAN | Blobs\u003Cbr>Moons\u003Cbr>Digits |\n| CAL | [ml-continual-regularization](tasks\u002Fml-continual-regularization) | Continual Learning Importance Regularizer | Changes parameter-importance estimation and regularization loss to reduce catastrophic forgetting and improve final average accuracy across contexts. | [GMvandeVen\u002Fcontinual-learning](https:\u002F\u002Fgithub.com\u002FGMvandeVen\u002Fcontinual-learning) | EWC\u003Cbr>SI\u003Cbr>Online EWC | Split-MNIST\u003Cbr>Permuted-MNIST\u003Cbr>Split-CIFAR100 |\n| CAL | [ml-dimensionality-reduction](tasks\u002Fml-dimensionality-reduction) | Nonlinear 2D Structure-Preserving Embedding | Studies how nonlinear dimensionality reduction preserves neighborhood structure in low-dimensional embeddings. | custom | PCA\u003Cbr>t-SNE\u003Cbr>UMAP\u003Cbr>TriMap\u003Cbr>PaCMAP | MNIST\u003Cbr>Fashion-MNIST\u003Cbr>20 Newsgroups |\n| CAL | [ml-ensemble-boosting](tasks\u002Fml-ensemble-boosting) | Adaptive Boosting Weight and Target Strategy | Studies how pseudo-targets, learner weights, and sample reweighting affect boosted ensemble performance. | custom | AdaBoost\u003Cbr>Gradient Boosting\u003Cbr>XGBoost-style | Breast Cancer\u003Cbr>Diabetes\u003Cbr>California Housing |\n| CAL | [ml-federated-aggregation](tasks\u002Fml-federated-aggregation) | Heterogeneous Federated Server Aggregation | Changes server-side client selection and model aggregation to improve federated test accuracy under heterogeneous client data. | [adap\u002Fflower](https:\u002F\u002Fgithub.com\u002Fadap\u002Fflower) | FedAvg\u003Cbr>FedProx\u003Cbr>SCAFFOLD | CIFAR-10 (Non-IID alpha=0.1)\u003Cbr>FEMNIST\u003Cbr>Shakespeare |\n| CAL | [ml-missing-data-imputation](tasks\u002Fml-missing-data-imputation) | Correlation-Aware Tabular Imputation | Studies how feature correlations and predictive structure guide missing-value imputation in tabular data. | custom | Mean Imputation\u003Cbr>KNN Imputation\u003Cbr>MICE\u003Cbr>MissForest\u003Cbr>GAIN | Breast Cancer Wisconsin\u003Cbr>Wine\u003Cbr>California Housing |\n| CAL | [ml-selective-deferral](tasks\u002Fml-selective-deferral) | Selective Deferral Under Subgroup Shift | Studies how acceptance and deferral rules trade off selective risk, subgroup robustness, and coverage on AIF360 tabular datasets. | custom | Confidence Thresholding\u003Cbr>Conformal Abstention\u003Cbr>Learned Deferral\u003Cbr>Group-wise Thresholding | Adult\u003Cbr>COMPAS\u003Cbr>Law School GPA |\n| CAL | [ml-subgroup-calibration-shift](tasks\u002Fml-subgroup-calibration-shift) | Shift-Robust Subgroup Calibration | Studies how post-hoc calibration behaves under subgroup distribution shift and worst-group reliability constraints on AIF360 tabular datasets. | custom | Temperature Scaling\u003Cbr>Isotonic Regression\u003Cbr>Beta Calibration\u003Cbr>Group-wise Temperature Scaling | Adult\u003Cbr>COMPAS\u003Cbr>Law School GPA |\n| CAL | [ml-symbolic-regression](tasks\u002Fml-symbolic-regression) | Genetic Programming Search for Symbolic Regression | Studies how symbolic-regression search strategies recover generalizable analytical expressions. | [trevorstephens\u002Fgplearn](https:\u002F\u002Fgithub.com\u002Ftrevorstephens\u002Fgplearn) | Standard GP\u003Cbr>Parsimony GP\u003Cbr>Lexicase GP | Nguyen-7\u003Cbr>Nguyen-10\u003Cbr>Koza-3 |\n| DL | [cv-classification-loss](tasks\u002Fcv-classification-loss) | Adaptive Classification Loss | Modify the training loss over logits and labels to improve classification accuracy across image-model families. | custom | Label Smoothing\u003Cbr>Focal Loss\u003Cbr>PolyLoss | ResNet-56 \u002F CIFAR-100\u003Cbr>VGG-16-BN \u002F CIFAR-100\u003Cbr>MobileNet-V2 \u002F Fashion-MNIST |\n| DL | [cv-data-augmentation](tasks\u002Fcv-data-augmentation) | Image Augmentation Policy | Design the training transform pipeline combining geometric, photometric, and erasing operations to improve image-classification generalization. | custom | Cutout\u003Cbr>RandAugment\u003Cbr>TrivialAugmentWide | ResNet-20 \u002F CIFAR-10\u003Cbr>ResNet-56 \u002F CIFAR-100\u003Cbr>MobileNet-V2 \u002F Fashion-MNIST |\n| DL | [cv-multitask-loss](tasks\u002Fcv-multitask-loss) | Hierarchical Classification Loss Weighting | Studies how fine-label and coarse-label objectives should be combined to improve hierarchical image classification. | custom | Uncertainty Weighting\u003Cbr>DWA\u003Cbr>PCGrad | ResNet-20 \u002F CIFAR-100-MT\u003Cbr>ResNet-56 \u002F CIFAR-100-MT\u003Cbr>VGG-16-BN \u002F CIFAR-100-MT |\n| DL | [cv-pooling-aggregation](tasks\u002Fcv-pooling-aggregation) | Spatial Feature Aggregation | Studies how global spatial features should be aggregated to improve image-classification accuracy across convolutional architectures. | custom | Global Max\u003Cbr>GeM\u003Cbr>Avg + Max | ResNet-56 \u002F CIFAR-100\u003Cbr>VGG-16-BN \u002F CIFAR-100\u003Cbr>MobileNet-V2 \u002F Fashion-MNIST |\n| DL | [cv-sample-weighting](tasks\u002Fcv-sample-weighting) | Long-Tail Class Reweighting | Studies how class-count statistics should be mapped to loss weights to improve test accuracy on balanced test sets for long-tailed image classification. | custom | Inverse Frequency\u003Cbr>Class-Balanced (Effective Number)\u003Cbr>Balanced Softmax | ResNet-32 \u002F CIFAR-10-LT\u003Cbr>ResNet-32 \u002F CIFAR-100-LT\u003Cbr>VGG-16-BN \u002F CIFAR-100-LT |\n| DL | [dl-activation-function](tasks\u002Fdl-activation-function) | Convolutional Activation Nonlinearity | Studies how drop-in activation functions affect accuracy across convolutional image classifiers. | custom | GELU\u003Cbr>SiLU\u003Cbr>Mish | ResNet-20 \u002F CIFAR-10\u003Cbr>VGG-16-BN \u002F CIFAR-100\u003Cbr>MobileNet-V2 \u002F Fashion-MNIST |\n| DL | [dl-lr-schedule](tasks\u002Fdl-lr-schedule) | Architecture-Aware Learning-Rate Scheduling | Designs an epoch-level learning-rate curve conditioned on architecture and dataset to improve convergence and final classification accuracy. | custom | Cosine\u003Cbr>WarmupCosine\u003Cbr>OneCycle | ResNet-20 \u002F CIFAR-10\u003Cbr>ResNet-56 \u002F CIFAR-100\u003Cbr>MobileNet-V2 \u002F Fashion-MNIST |\n| DL | [dl-normalization](tasks\u002Fdl-normalization) | Normalization Statistics and Affine Design | Studies how normalization statistics and affine behavior affect convolutional training stability and test accuracy. | custom | GroupNorm\u003Cbr>Batch-Instance Norm\u003Cbr>Switchable Norm | ResNet-56 \u002F CIFAR-100\u003Cbr>ResNet-110 \u002F CIFAR-100\u003Cbr>MobileNet-V2 \u002F Fashion-MNIST |\n| DL | [dl-regularization](tasks\u002Fdl-regularization) | Adaptive Regularization Loss | Adds a model-, output-, input-, or epoch-dependent regularization term to improve classification generalization beyond standard weight decay. | custom | DropBlock\u003Cbr>Confidence Penalty\u003Cbr>Orthogonal Regularization | ResNet-56 \u002F CIFAR-100\u003Cbr>VGG-16-BN \u002F CIFAR-100\u003Cbr>MobileNet-V2 \u002F Fashion-MNIST |\n| DL | [dl-residual-connection](tasks\u002Fdl-residual-connection) | Residual Block Skip Design | Studies how shortcut transformations and residual branch computation affect optimization and generalization across network depths. | custom | Pre-Activation\u003Cbr>Gated Residual\u003Cbr>Stochastic Depth | ResNet-20 \u002F CIFAR-10\u003Cbr>ResNet-56 \u002F CIFAR-100\u003Cbr>ResNet-110 \u002F CIFAR-100 |\n| DL | [dl-weight-initialization](tasks\u002Fdl-weight-initialization) | DL Weight Initialization Strategy Design | Designs data-independent initialization for convolutional, normalization, and classifier layers to improve convergence and final accuracy. | custom | Kaiming Normal\u003Cbr>Fixup\u003Cbr>Orthogonal | ResNet-56 \u002F CIFAR-100\u003Cbr>VGG-16-BN \u002F CIFAR-100\u003Cbr>MobileNet-V2 \u002F Fashion-MNIST |\n| TS | [quant-concept-drift](tasks\u002Fquant-concept-drift) | Concept-Drift-Aware Quantitative Forecasting | The stock prediction model and data pipeline are redesigned to handle temporal distribution shift and improve signal quality and portfolio metrics. | [microsoft\u002Fqlib](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fqlib) | TRA\u003Cbr>AdaRNN\u003Cbr>LightGBM | CSI 300\u003Cbr>CSI 300 (Shifted)\u003Cbr>CSI 300 (Recent) |\n| TS | [quant-graph-stock](tasks\u002Fquant-graph-stock) | Graph-Based Quantitative Forecasting | Studies how inter-asset graph relationships affect return signal quality and portfolio performance. | [microsoft\u002Fqlib](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fqlib) | HIST\u003Cbr>GATs\u003Cbr>LightGBM | CSI 300\u003Cbr>CSI 100\u003Cbr>CSI 300 (Recent) |\n| TS | [quant-stock-prediction](tasks\u002Fquant-stock-prediction) | Quantitative Return Forecasting | Studies how predictive models and input processing affect next-period return signals and portfolio performance. | [microsoft\u002Fqlib](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fqlib) | LightGBM\u003Cbr>LSTM\u003Cbr>Transformer | CSI 300\u003Cbr>CSI 100\u003Cbr>CSI 300 (Recent) |\n| TS | [stf-traffic-forecast](tasks\u002Fstf-traffic-forecast) | Spatial-Temporal Traffic Forecasting Model | Studies how spatial-temporal models capture sensor-network dependencies for traffic forecasting. | [GestaltCogTeam\u002FBasicTS](https:\u002F\u002Fgithub.com\u002FGestaltCogTeam\u002FBasicTS) | STID\u003Cbr>DLinear\u003Cbr>StemGNN\u003Cbr>iTransformer\u003Cbr>TimesNet\u003Cbr>SOFTS\u003Cbr>TimeMixer | METR-LA\u003Cbr>PEMS-BAY\u003Cbr>PEMS04 |\n| TS | [ts-anomaly-detection](tasks\u002Fts-anomaly-detection) | Reconstruction Model for Time-Series Anomaly Detection | An unsupervised reconstruction model detects anomalous multivariate time-series segments to improve F-score. | [thuml\u002FTime-Series-Library](https:\u002F\u002Fgithub.com\u002Fthuml\u002FTime-Series-Library) | DLinear\u003Cbr>TimesNet\u003Cbr>PatchTST | PSM\u003Cbr>MSL\u003Cbr>SMAP |\n| TS | [ts-classification](tasks\u002Fts-classification) | Multivariate Time-Series Classification Model | Studies how representation learning improves classification of multivariate time-series signals. | [thuml\u002FTime-Series-Library](https:\u002F\u002Fgithub.com\u002Fthuml\u002FTime-Series-Library) | DLinear\u003Cbr>TimesNet\u003Cbr>PatchTST | EthanolConcentration\u003Cbr>FaceDetection\u003Cbr>Handwriting |\n| TS | [ts-exogenous-forecast](tasks\u002Fts-exogenous-forecast) | Exogenous-Variable Target Forecasting Model | Studies how exogenous variables improve target-channel forecasting. | [thuml\u002FTime-Series-Library](https:\u002F\u002Fgithub.com\u002Fthuml\u002FTime-Series-Library) | DLinear\u003Cbr>PatchTST\u003Cbr>iTransformer\u003Cbr>TimeXer | ETTh1\u003Cbr>Weather\u003Cbr>ECL |\n| TS | [ts-imputation](tasks\u002Fts-imputation) | Masked Multivariate Time-Series Imputation | Studies how imputation models reconstruct missing regions in multivariate time series. | [thuml\u002FTime-Series-Library](https:\u002F\u002Fgithub.com\u002Fthuml\u002FTime-Series-Library) | DLinear\u003Cbr>TimesNet\u003Cbr>PatchTST | ETTh1 (25% missing)\u003Cbr>Weather (25% missing)\u003Cbr>ECL (25% missing) |\n| TS | [ts-long-term-forecast](tasks\u002Fts-long-term-forecast) | Multivariate Long-Horizon Forecasting Model | Studies how long-horizon forecasting models predict future multivariate sequences. | [thuml\u002FTime-Series-Library](https:\u002F\u002Fgithub.com\u002Fthuml\u002FTime-Series-Library) | DLinear\u003Cbr>PatchTST\u003Cbr>iTransformer\u003Cbr>TimeMixer\u003Cbr>TimeXer | ETTh1\u003Cbr>Weather\u003Cbr>ECL |\n| TS | [ts-short-term-forecast](tasks\u002Fts-short-term-forecast) | Univariate Short-Horizon Forecasting Model | Studies how short-horizon forecasting models predict seasonal univariate series. | [thuml\u002FTime-Series-Library](https:\u002F\u002Fgithub.com\u002Fthuml\u002FTime-Series-Library) | DLinear\u003Cbr>TimesNet\u003Cbr>PatchTST\u003Cbr>TimeMixer | M4 Monthly\u003Cbr>M4 Quarterly\u003Cbr>M4 Yearly |\n| SCR | [causal-discovery-discrete](tasks\u002Fcausal-discovery-discrete) | Discrete Causal Graph Discovery | Studies how causal discovery algorithms recover equivalence-class graph structure from discrete observational data. | [py-why\u002Fcausal-learn](https:\u002F\u002Fgithub.com\u002Fpy-why\u002Fcausal-learn) | PC\u003Cbr>GES\u003Cbr>GRaSP\u003Cbr>BOSS\u003Cbr>Hill Climbing | Cancer\u003Cbr>Child\u003Cbr>ALARM\u003Cbr>HAILFINDER\u003Cbr>Win95pts |\n| SCR | [causal-observational-linear-gaussian](tasks\u002Fcausal-observational-linear-gaussian) | Linear Gaussian Causal Discovery | Studies how observational algorithms recover causal graph structure under linear Gaussian assumptions. | [py-why\u002Fcausal-learn](https:\u002F\u002Fgithub.com\u002Fpy-why\u002Fcausal-learn) | PC\u003Cbr>GRaSP\u003Cbr>BOSS | ER (n=10)\u003Cbr>ER (n=20)\u003Cbr>SF (n=50)\u003Cbr>SF (n=50, Hard)\u003Cbr>ER (n=20, Noisy) |\n| SCR | [causal-observational-linear-non-gaussian](tasks\u002Fcausal-observational-linear-non-gaussian) | Non-Gaussian Causal Discovery | Studies how non-Gaussian structure can identify directed causal relationships from observational data. | [py-why\u002Fcausal-learn](https:\u002F\u002Fgithub.com\u002Fpy-why\u002Fcausal-learn) | ICA-LiNGAM\u003Cbr>DirectLiNGAM\u003Cbr>NOTEARS | ER (n=30)\u003Cbr>ER (n=50)\u003Cbr>SF (n=100) |\n| SCR | [causal-observational-nonlinear](tasks\u002Fcausal-observational-nonlinear) | Nonlinear Causal Discovery | Studies how nonlinear additive-noise assumptions support directed causal graph recovery from observations. | [py-why\u002Fcausal-learn](https:\u002F\u002Fgithub.com\u002Fpy-why\u002Fcausal-learn) | CAM\u003Cbr>NOTEARS-MLP\u003Cbr>DirectLiNGAM\u003Cbr>GraN-DAG | SF (n=20, GP)\u003Cbr>ER (n=20, Gauss)\u003Cbr>ER (n=12, Low-Sample) |\n| SCR | [causal-treatment-effect](tasks\u002Fcausal-treatment-effect) | Heterogeneous Treatment Effect Estimation | Studies how observational estimators recover individual and average treatment effects on synthetic CATE benchmark families. | custom | S-Learner\u003Cbr>T-Learner\u003Cbr>IPW\u003Cbr>Causal Forest\u003Cbr>DR-Learner\u003Cbr>R-Learner | IHDP-inspired Synth\u003Cbr>Jobs\u002FLaLonde-inspired Synth\u003Cbr>ACIC-inspired Synth |\n| SCR | [graph-generation](tasks\u002Fgraph-generation) | Unconditional Graph Generator Architecture | Studies how graph generator architecture affects distributional match to target graph statistics. | [pyg-team\u002Fpytorch_geometric](https:\u002F\u002Fgithub.com\u002Fpyg-team\u002Fpytorch_geometric) | GraphVAE\u003Cbr>GRAN\u003Cbr>DiGress | Community-Small\u003Cbr>Ego-Small\u003Cbr>ENZYMES |\n| SCR | [graph-graph-classification](tasks\u002Fgraph-graph-classification) | Structure-Aware Graph Readout Pooling | Studies how graph-level readout mechanisms affect graph classification accuracy and macro F1 under a fixed message-passing backbone. | [pyg-team\u002Fpytorch_geometric](https:\u002F\u002Fgithub.com\u002Fpyg-team\u002Fpytorch_geometric) | GIN + Sum\u003Cbr>SAGPool\u003Cbr>DiffPool | MUTAG\u003Cbr>PROTEINS\u003Cbr>NCI1 |\n| SCR | [graph-lin","MLS-Bench 是一个面向机器学习科学研究的基准测试工具。它不仅评估固定实例上的工程能力，还挑战AI代理能否提出新的组件、损失函数、优化器或训练流程，并在不同设置、种子、数据集和规模上保持有效性。项目覆盖了12个ML研究领域的140项任务，每项任务都提供了一个研究框架、相关源代码及强大的基线实现，要求在限定范围内进行算法改进。MLS-Bench 支持多种运行时后端（Docker、Apptainer或本地Conda）与作业调度器（SLURM或内置单节点GPU调度器），适用于需要对机器学习模型进行创新性改进的研究场景。","2026-06-11 04:02:33","CREATED_QUERY"]