[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72651":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":10,"languages":10,"totalLinesOfCode":10,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},72651,"tuning_playbook","google-research\u002Ftuning_playbook","google-research","A playbook for systematically maximizing the performance of deep learning models.","",null,30183,2423,302,11,0,10,27,80,30,45,"Other",false,"main",[],"2026-06-12 02:03:06","# Deep Learning Tuning Playbook\n\n*This is not an officially supported Google product.*\n\n**Varun Godbole\u003Csup>&dagger;\u003C\u002Fsup>, George E. Dahl\u003Csup>&dagger;\u003C\u002Fsup>, Justin Gilmer\u003Csup>&dagger;\u003C\u002Fsup>, Christopher J. Shallue\u003Csup>&Dagger;\u003C\u002Fsup>, Zachary Nado\u003Csup>&dagger;\u003C\u002Fsup>**\n\n\n&dagger; Google Research, Brain Team\n\n&Dagger; Harvard University\n\n## Table of Contents\n\n-   [Who is this document for?](#who-is-this-document-for)\n-   [Why a tuning playbook?](#why-a-tuning-playbook)\n-   [Guide for starting a new project](#guide-for-starting-a-new-project)\n    -   [Choosing the model architecture](#choosing-the-model-architecture)\n    -   [Choosing the optimizer](#choosing-the-optimizer)\n    -   [Choosing the batch size](#choosing-the-batch-size)\n    -   [Choosing the initial configuration](#choosing-the-initial-configuration)\n-   [A scientific approach to improving model performance](#a-scientific-approach-to-improving-model-performance)\n    -   [The incremental tuning strategy](#the-incremental-tuning-strategy)\n    -   [Exploration vs exploitation](#exploration-vs-exploitation)\n    -   [Choosing the goal for the next round of experiments](#choosing-the-goal-for-the-next-round-of-experiments)\n    -   [Designing the next round of experiments](#Designing-the-next-round-of-experiments)\n    -   [Determining whether to adopt a training pipeline change or\n        hyperparameter\n        configuration](#Determining-whether-to-adopt-a-training-pipeline-change-or-hyperparameter-configuration)\n    -   [After exploration concludes](#After-exploration-concludes)\n-   [Determining the number of steps for each training run](#Determining-the-number-of-steps-for-each-training-run)\n    -   [Deciding how long to train when training is not compute-bound](#Deciding-how-long-to-train-when-training-is-not-compute-bound)\n    -   [Deciding how long to train when training is compute-bound](#Deciding-how-long-to-train-when-training-is-compute-bound)\n-   [Additional guidance for the training pipeline](#Additional-guidance-for-the-training-pipeline)\n    -   [Optimizing the input pipeline](#Optimizing-the-input-pipeline)\n    -   [Evaluating model performance](#evaluating-model-performance)\n    -   [Saving checkpoints and retrospectively selecting the best checkpoint](#Saving-checkpoints-and-retrospectively-selecting-the-best-checkpoint)\n    -   [Setting up experiment tracking](#Setting-up-experiment-tracking)\n    -   [Batch normalization implementation details](#Batch-normalization-implementation-details)\n    -   [Considerations for multi-host pipelines](#Considerations-for-multi-host-pipelines)\n-   [FAQs](#faqs)\n-   [Acknowledgments](#acknowledgments)\n-   [Citing](#citing)\n-   [Contributing](#contributing)\n\n## Who is this document for?\n\nThis document is for engineers and researchers (both individuals and teams)\ninterested in **maximizing the performance of deep learning models**. We assume\nbasic knowledge of machine learning and deep learning concepts.\n\nOur emphasis is on the **process of hyperparameter tuning**. We touch on other\naspects of deep learning training, such as pipeline implementation and\noptimization, but our treatment of those aspects is not intended to be complete.\n\nWe assume the machine learning problem is a supervised learning problem or\nsomething that looks a lot like one (e.g. self-supervised). That said, some of\nthe prescriptions in this document may also apply to other types of problems.\n\n## Why a tuning playbook?\n\nCurrently, there is an astonishing amount of toil and guesswork involved in\nactually getting deep neural networks to work well in practice. Even worse, the\nactual recipes people use to get good results with deep learning are rarely\ndocumented. Papers gloss over the process that led to their final results in\norder to present a cleaner story, and machine learning engineers working on\ncommercial problems rarely have time to take a step back and generalize their\nprocess. Textbooks tend to eschew practical guidance and prioritize fundamental\nprinciples, even if their authors have the necessary experience in applied work\nto provide useful advice. When preparing to create this document, we couldn't\nfind any comprehensive attempt to actually explain *how to get good results with\ndeep learning*. Instead, we found snippets of advice in blog posts and on social\nmedia, tricks peeking out of the appendix of research papers, occasional case\nstudies about one particular project or pipeline, and a lot of confusion. There\nis a vast gulf between the results achieved by deep learning experts and less\nskilled practitioners using superficially similar methods. At the same time,\nthese very experts readily admit some of what they do might not be\nwell-justified. As deep learning matures and has a larger impact on the world,\nthe community needs more resources covering useful recipes, including all the\npractical details that can be so critical for obtaining good results.\n\nWe are a team of five researchers and engineers who have worked in deep learning\nfor many years, some of us since as early as 2006. We have applied deep learning\nto problems in everything from speech recognition to astronomy, and learned a\nlot along the way. This document grew out of our own experience training neural\nnetworks, teaching new machine learning engineers, and advising our colleagues\non the practice of deep learning. Although it has been gratifying to see deep\nlearning go from a machine learning approach practiced by a handful of academic\nlabs to a technology powering products used by billions of people, deep learning\nis still in its infancy as an engineering discipline and we hope this document\nencourages others to help systematize the field's experimental protocols.\n\nThis document came about as we tried to crystalize our own approach to deep\nlearning and thus it represents the opinions of the authors at the time of\nwriting, not any sort of objective truth. Our own struggles with hyperparameter\ntuning made it a particular focus of our guidance, but we also cover other\nimportant issues we have encountered in our work (or seen go wrong). Our\nintention is for this work to be a living document that grows and evolves as our\nbeliefs change. For example, the material on debugging and mitigating training\nfailures would not have been possible for us to write two years ago since it is\nbased on recent results and ongoing investigations. Inevitably, some of our\nadvice will need to be updated to account for new results and improved\nworkflows. We do not know the *optimal* deep learning recipe, but until the\ncommunity starts writing down and debating different procedures, we cannot hope\nto find it. To that end, we would encourage readers who find issues with our\nadvice to produce alternative recommendations, along with convincing evidence,\nso we can update the playbook. We would also love to see alternative guides and\nplaybooks that might have different recommendations so we can work towards best\npractices as a community. Finally, any sections marked with a 🤖 emoji are places\nwe would like to do more research. Only after trying to write this playbook did\nit become completely clear how many interesting and neglected research questions\ncan be found in the deep learning practitioner's workflow.\n\n## Guide for starting a new project\n\nMany of the decisions we make over the course of tuning can be made once at the\nbeginning of a project and only occasionally revisited when circumstances\nchange.\n\nOur guidance below makes the following assumptions:\n\n-   Enough of the essential work of problem formulation, data cleaning, etc. has\n    already been done that spending time on the model architecture and training\n    configuration makes sense.\n-   There is already a pipeline set up that does training and evaluation, and it\n    is easy to execute training and prediction jobs for various models of\n    interest.\n-   The appropriate metrics have been selected and implemented. These should be\n    as representative as possible of what would be measured in the deployed\n    environment.\n\n### Choosing the model architecture\n\n***Summary:*** *When starting a new project, try to reuse a model that already\nworks.*\n\n-   Choose a well established, commonly used model architecture to get working\n    first. It is always possible to build a custom model later.\n-   Model architectures typically have various hyperparameters that determine\n    the model's size and other details (e.g. number of layers, layer width, type\n    of activation function).\n    -   Thus, choosing the architecture really means choosing a family of\n        different models (one for each setting of the model hyperparameters).\n    -   We will consider the problem of choosing the model hyperparameters in\n        [Choosing the initial configuration](#choosing-the-initial-configuration)\n        and\n        [A scientific approach to improving model performance](#a-scientific-approach-to-improving-model-performance).\n-   When possible, try to find a paper that tackles something as close as\n    possible to the problem at hand and reproduce that model as a starting\n    point.\n\n### Choosing the optimizer\n\n***Summary:*** *Start with the most popular optimizer for the type of problem at\nhand.*\n\n-   No optimizer is the \"best\" across all types of machine learning problems and\n    model architectures. Even just\n    [comparing the performance of optimizers is a difficult task](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.05446).\n    🤖\n-   We recommend sticking with well-established, popular optimizers, especially\n    when starting a new project.\n    -   Ideally, choose the most popular optimizer used for the same type of\n        problem.\n-   Be prepared to give attention to **\\*****all****\\*** hyperparameters of the\n    chosen optimizer.\n    -   Optimizers with more hyperparameters may require more tuning effort to\n        find the best configuration.\n    -   This is particularly relevant in the beginning stages of a project when\n        we are trying to find the best values of various other hyperparameters\n        (e.g. architecture hyperparameters) while treating optimizer\n        hyperparameters as\n        [nuisance parameters](#identifying-scientific-nuisance-and-fixed-hyperparameters).\n    -   It may be preferable to start with a simpler optimizer (e.g. SGD with\n        fixed momentum or Adam with fixed $\\epsilon$, $\\beta_{1}$, and\n        $\\beta_{2}$) in the initial stages of the project and switch to a more\n        general optimizer later.\n-   Well-established optimizers that we like include (but are not limited to):\n    -   [SGD with momentum](#what-are-the-update-rules-for-all-the-popular-optimization-algorithms)\n        (we like the Nesterov variant)\n    -   [Adam and NAdam](#what-are-the-update-rules-for-all-the-popular-optimization-algorithms),\n        which are more general than SGD with momentum. Note that Adam has 4\n        tunable hyperparameters\n        [and they can all matter](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.05446)!\n        -   See\n            [How should Adam's hyperparameters be tuned?](#how-should-adams-hyperparameters-be-tuned)\n\n### Choosing the batch size\n\n***Summary:*** *The batch size governs the training speed and shouldn't be used\nto directly tune the validation set performance. Often, the ideal batch size\nwill be the largest batch size supported by the available hardware.*\n\n-   The batch size is a key factor in determining the *training time* and\n    *computing resource consumption*.\n-   Increasing the batch size will often reduce the training time. This can be\n    highly beneficial because it, e.g.:\n    -   Allows hyperparameters to be tuned more thoroughly within a fixed time\n        interval, potentially resulting in a better final model.\n    -   Reduces the latency of the development cycle, allowing new ideas to be\n        tested more frequently.\n-   Increasing the batch size may either decrease, increase, or not change the\n    resource consumption.\n-   The batch size should *not be* treated as a tunable hyperparameter for\n    validation set performance.\n    -   As long as all hyperparameters are well-tuned (especially the learning\n        rate and regularization hyperparameters) and the number of training\n        steps is sufficient, the same final performance should be attainable\n        using any batch size (see\n        [Shallue et al. 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)).\n    -   Please see [Why shouldn't the batch size be tuned to directly improve\n        validation set\n        performance?](#why-shouldnt-the-batch-size-be-tuned-to-directly-improve-validation-set-performance)\n\n#### Determining the feasible batch sizes and estimating training throughput\n\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   For a given model and optimizer, there will typically be a range of batch\n    sizes supported by the available hardware. The limiting factor is usually\n    accelerator memory.\n-   Unfortunately, it can be difficult to calculate which batch sizes will fit\n    in memory without running, or at least compiling, the full training program.\n-   The easiest solution is usually to run training jobs at different batch\n    sizes (e.g. increasing powers of 2) for a small number of steps until one of\n    the jobs exceeds the available memory.\n-   For each batch size, we should train for long enough to get a reliable\n    estimate of the *training throughput*\n\n\u003Cp align=\"center\">training throughput = (# examples processed per second)\u003C\u002Fp>\n\n\u003Cp align=\"center\">or, equivalently, the \u003Cem>time per step\u003C\u002Fem>.\u003C\u002Fp>\n\n\u003Cp align=\"center\">time per step = (batch size) \u002F (training throughput)\u003C\u002Fp>\n\n-   When the accelerators aren't yet saturated, if the batch size doubles, the\n    training throughput should also double (or at least nearly double).\n    Equivalently, the time per step should be constant (or at least nearly\n    constant) as the batch size increases.\n-   If this is not the case then the training pipeline has a bottleneck such as\n    I\u002FO or synchronization between compute nodes. This may be worth diagnosing\n    and correcting before proceeding.\n-   If the training throughput increases only up to some maximum batch size,\n    then we should only consider batch sizes up to that maximum batch size, even\n    if a larger batch size is supported by the hardware.\n    -   All benefits of using a larger batch size assume the training throughput\n        increases. If it doesn't, fix the bottleneck or use the smaller batch\n        size.\n    -   **Gradient accumulation** simulates a larger batch size than the\n        hardware can support and therefore does not provide any throughput\n        benefits. It should generally be avoided in applied work.\n-   These steps may need to be repeated every time the model or optimizer is\n    changed (e.g. a different model architecture may allow a larger batch size\n    to fit in memory).\n\n\u003C\u002Fdetails>\n\n#### Choosing the batch size to minimize training time\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n\u003Cp align=\"center\">Training time = (time per step) x (total number of steps)\u003C\u002Fp>\n\n-   We can often consider the time per step to be approximately constant for all\n    feasible batch sizes. This is true when there is no overhead from parallel\n    computations and all training bottlenecks have been diagnosed and corrected\n    (see the\n    [previous section](#determining-the-feasible-batch-sizes-and-estimating-training-throughput)\n    for how to identify training bottlenecks). In practice, there is usually at\n    least some overhead from increasing the batch size.\n-   As the batch size increases, the total number of steps needed to reach a\n    fixed performance goal typically decreases (provided all relevant\n    hyperparameters are re-tuned when the batch size is changed;\n    [Shallue et al. 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)).\n    -   E.g. Doubling the batch size might halve the total number of steps\n        required. This is called **perfect scaling**.\n    -   Perfect scaling holds for all batch sizes up to a critical batch size,\n        beyond which one achieves diminishing returns.\n    -   Eventually, increasing the batch size no longer reduces the number of\n        training steps (but never increases it).\n-   Therefore, the batch size that minimizes training time is usually the\n    largest batch size that still provides a reduction in the number of training\n    steps required.\n    -   This batch size depends on the dataset, model, and optimizer, and it is\n        an open problem how to calculate it other than finding it experimentally\n        for every new problem. 🤖\n    -   When comparing batch sizes, beware the distinction between an example\n        budget\u002F[epoch](https:\u002F\u002Fdevelopers.google.com\u002Fmachine-learning\u002Fglossary#epoch)\n        budget (running all experiments while fixing the number of training\n        example presentations) and a step budget (running all experiments with\n        the number of training steps fixed).\n        -   Comparing batch sizes with an epoch budget only probes the perfect\n            scaling regime, even when larger batch sizes might still provide a\n            meaningful speedup by reducing the number of training steps\n            required.\n    -   Often, the largest batch size supported by the available hardware will\n        be smaller than the critical batch size. Therefore, a good rule of thumb\n        (without running any experiments) is to use the largest batch size\n        possible.\n-   There is no point in using a larger batch size if it ends up increasing the\n    training time.\n\n\u003C\u002Fdetails>\n\n#### Choosing the batch size to minimize resource consumption\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   There are two types of resource costs associated with increasing the batch\n    size:\n    1.  *Upfront costs*, e.g. purchasing new hardware or rewriting the training\n        pipeline to implement multi-GPU \u002F multi-TPU training.\n    2.  *Usage costs*, e.g. billing against the team's resource budgets, billing\n        from a cloud provider, electricity \u002F maintenance costs.\n-   If there are significant upfront costs to increasing the batch size, it\n    might be better to defer increasing the batch size until the project has\n    matured and it is easier to assess the cost-benefit tradeoff. Implementing\n    multi-host parallel training programs can introduce\n    [bugs](#considerations-for-multi-host-pipelines) and\n    [subtle issues](#batch-normalization-implementation-details) so it is\n    probably better to start off with a simpler pipeline anyway. (On the other\n    hand, a large speedup in training time might be very beneficial early in the\n    process when a lot of tuning experiments are needed).\n-   We refer to the total usage cost (which may include multiple different kinds\n    of costs) as the \"resource consumption\". We can break down the resource\n    consumption into the following components:\n\n\u003Cp align=\"center\">Resource consumption = (resource consumption per step) x (total number of steps)\u003C\u002Fp>\n\n-   Increasing the batch size usually allows us to\n    [reduce the total number of steps](#choosing-the-batch-size-to-minimize-training-time).\n    Whether the resource consumption increases or decreases will depend on how\n    the consumption per step changes.\n    -   Increasing the batch size might *decrease* the resource consumption. For\n        example, if each step with the larger batch size can be run on the same\n        hardware as the smaller batch size (with only a small increase in time\n        per step), then any increase in the resource consumption per step might\n        be outweighed by the decrease in the number of steps.\n    -   Increasing the batch size might *not change* the resource consumption.\n        For example, if doubling the batch size halves the number of steps\n        required and doubles the number of GPUs used, the total consumption (in\n        terms of GPU-hours) will not change.\n    -   Increasing the batch size might *increase* the resource consumption. For\n        example, if increasing the batch size requires upgraded hardware, the\n        increase in consumption per step might outweigh the reduction in the\n        number of steps.\n\n\u003C\u002Fdetails>\n\n#### Changing the batch size requires re-tuning most hyperparameters\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   The optimal values of most hyperparameters are sensitive to the batch size.\n    Therefore, changing the batch size typically requires starting the tuning\n    process all over again.\n-   The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.\n-   Keep this in mind when choosing the batch size at the start of a project. If\n    you need to switch to a different batch size later on, it might be\n    difficult, time consuming, and expensive to re-tune everything for the new\n    batch size.\n\n\u003C\u002Fdetails>\n\n#### How batch norm interacts with the batch size\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   Batch norm is complicated and, in general, should use a different batch size\n    than the gradient computation to compute statistics. See the\n    [batch norm section](#batch-normalization-implementation-details) for a\n    detailed discussion.\n\n\u003C\u002Fdetails>\n\n### Choosing the initial configuration\n\n-   Before beginning hyperparameter tuning we must determine the starting point.\n    This includes specifying (1) the model configuration (e.g. number of\n    layers), (2) the optimizer hyperparameters (e.g. learning rate), and (3) the\n    number of training steps.\n-   Determining this initial configuration will require some manually configured\n    training runs and trial-and-error.\n-   Our guiding principle is to find a simple, relatively fast, relatively\n    low-resource-consumption configuration that obtains a \"reasonable\" result.\n    -   \"Simple\" means avoiding bells and whistles wherever possible; these can\n        always be added later. Even if bells and whistles prove helpful down the\n        road, adding them in the initial configuration risks wasting time tuning\n        unhelpful features and\u002For baking in unnecessary complications.\n        -   For example, start with a constant learning rate before adding fancy\n            decay schedules.\n    -   Choosing an initial configuration that is fast and consumes minimal\n        resources will make hyperparameter tuning much more efficient.\n        -   For example, start with a smaller model.\n    -   \"Reasonable\" performance depends on the problem, but at minimum means\n        that the trained model performs much better than random chance on the\n        validation set (although it might be bad enough to not be worth\n        deploying).\n-   Choosing the number of training steps involves balancing the following\n    tension:\n    -   On the one hand, training for more steps can improve performance and\n        makes hyperparameter tuning easier (see\n        [Shallue et al. 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)).\n    -   On the other hand, training for fewer steps means that each training run\n        is faster and uses fewer resources, boosting tuning efficiency by\n        reducing the time between cycles and allowing more experiments to be run\n        in parallel. Moreover, if an unnecessarily large step budget is chosen\n        initially, it might be hard to change it down the road, e.g. once the\n        learning rate schedule is tuned for that number of steps.\n\n## A scientific approach to improving model performance\n\nFor the purposes of this document, the ultimate goal of machine learning\ndevelopment is to maximize the utility of the deployed model. Even though many\naspects of the development process differ between applications (e.g. length of\ntime, available computing resources, type of model), we can typically use the\nsame basic steps and principles on any problem.\n\nOur guidance below makes the following assumptions:\n\n-   There is already a fully-running training pipeline along with a\n    configuration that obtains a reasonable result.\n-   There are enough computational resources available to conduct meaningful\n    tuning experiments and run at least several training jobs in parallel.\n\n### The incremental tuning strategy\n\n***Summary:*** *Start with a simple configuration and incrementally make\nimprovements while building up insight into the problem. Make sure that any\nimprovement is based on strong evidence to avoid adding unnecessary complexity.*\n\n-   Our ultimate goal is to find a configuration that maximizes the performance\n    of our model.\n    -   In some cases, our goal will be to maximize how much we can improve the\n        model by a fixed deadline (e.g. submitting to a competition).\n    -   In other cases, we want to keep improving the model indefinitely (e.g.\n        continually improving a model used in production).\n-   In principle, we could maximize performance by using an algorithm to\n    automatically search the entire space of possible configurations, but this\n    is not a practical option.\n    -   The space of possible configurations is extremely large and there are\n        not yet any algorithms sophisticated enough to efficiently search this\n        space without human guidance.\n-   Most automated search algorithms rely on a hand-designed *search space* that\n    defines the set of configurations to search in, and these search spaces can\n    matter quite a bit.\n-   The most effective way to maximize performance is to start with a simple\n    configuration and incrementally add features and make improvements while\n    building up insight into the problem.\n    -   We use automated search algorithms in each round of tuning and\n        continually update our search spaces as our understanding grows.\n-   As we explore, we will naturally find better and better configurations and\n    therefore our \"best\" model will continually improve.\n    -   We call it a *launch* when we update our best configuration (which may\n        or may not correspond to an actual launch of a production model).\n    -   For each launch, we must make sure that the change is based on strong\n        evidence – not just random chance based on a lucky configuration – so\n        that we don't add unnecessary complexity to the training pipeline.\n\nAt a high level, our incremental tuning strategy involves repeating the\nfollowing four steps:\n\n1.  Identify an appropriately-scoped goal for the next round of experiments.\n2.  Design and run a set of experiments that makes progress towards this goal.\n3.  Learn what we can from the results.\n4.  Consider whether to launch the new best configuration.\n\nThe remainder of this section will consider this strategy in much greater\ndetail.\n\n### Exploration vs exploitation\n\n***Summary:*** *Most of the time, our primary goal is to gain insight into the\nproblem.*\n\n-   Although one might think we would spend most of our time trying to maximize\n    performance on the validation set, in practice we spend the majority of our\n    time trying to gain insight into the problem, and comparatively little time\n    greedily focused on the validation error.\n    -   In other words, we spend most of our time on \"exploration\" and only a\n        small amount on \"exploitation\".\n-   In the long run, understanding the problem is critical if we want to\n    maximize our final performance. Prioritizing insight over short term gains\n    can help us:\n    -   Avoid launching unnecessary changes that happened to be present in\n        well-performing runs merely through historical accident.\n    -   Identify which hyperparameters the validation error is most sensitive\n        to, which hyperparameters interact the most and therefore need to be\n        re-tuned together, and which hyperparameters are relatively insensitive\n        to other changes and can therefore be fixed in future experiments.\n    -   Suggest potential new features to try, such as new regularizers if\n        overfitting is an issue.\n    -   Identify features that don't help and therefore can be removed, reducing\n        the complexity of future experiments.\n    -   Recognize when improvements from hyperparameter tuning have likely\n        saturated.\n    -   Narrow our search spaces around the optimal value to improve tuning\n        efficiency.\n-   When we are eventually ready to be greedy, we can focus purely on the\n    validation error even if the experiments aren't maximally informative about\n    the structure of the tuning problem.\n\n### Choosing the goal for the next round of experiments\n\n***Summary:*** *Each round of experiments should have a clear goal and be\nsufficiently narrow in scope that the experiments can actually make progress\ntowards the goal.*\n\n-   Each round of experiments should have a clear goal and be sufficiently\n    narrow in scope that the experiments can actually make progress towards the\n    goal: if we try to add multiple features or answer multiple questions at\n    once, we may not be able to disentangle the separate effects on the results.\n-   Example goals include:\n    -   Try a potential improvement to the pipeline (e.g. a new regularizer,\n        preprocessing choice, etc.).\n    -   Understand the impact of a particular model hyperparameter (e.g. the\n        activation function)\n    -   Greedily minimize validation error.\n\n### Designing the next round of experiments\n\n***Summary:*** *Identify which hyperparameters are scientific, nuisance, and\nfixed hyperparameters for the experimental goal. Create a sequence of studies to\ncompare different values of the scientific hyperparameters while optimizing over\nthe nuisance hyperparameters. Choose the search space of nuisance\nhyperparameters to balance resource costs with scientific value.*\n\n#### Identifying scientific, nuisance, and fixed hyperparameters\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   For a given goal, all hyperparameters will be either **scientific\n    hyperparameters**, **nuisance hyperparameters**, or **fixed\n    hyperparameters**.\n    -   Scientific hyperparameters are those whose effect on the model's\n        performance we're trying to measure.\n    -   Nuisance hyperparameters are those that need to be optimized over in\n        order to fairly compare different values of the scientific\n        hyperparameters. This is similar to the statistical concept of\n        [nuisance parameters](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNuisance_parameter).\n    -   Fixed hyperparameters will have their values fixed in the current round\n        of experiments. These are hyperparameters whose values do not need to\n        (or we do not want them to) change when comparing different values of\n        the scientific hyperparameters.\n        -   By fixing certain hyperparameters for a set of experiments, we must\n            accept that conclusions derived from the experiments might not be\n            valid for other settings of the fixed hyperparameters. In other\n            words, fixed hyperparameters create caveats for any conclusions we\n            draw from the experiments.\n-   For example, if our goal is to \"determine whether a model with more hidden\n    layers will reduce validation error\", then the number of hidden layers is a\n    scientific hyperparameter.\n    -   The learning rate is a nuisance hyperparameter because we can only\n        fairly compare models with different numbers of hidden layers if the\n        learning rate is tuned separately for each number of layers (the optimal\n        learning rate generally depends on the model architecture).\n    -   The activation function could be a fixed hyperparameter if we have\n        determined in prior experiments that the best choice of activation\n        function is not sensitive to model depth, or if we are willing to limit\n        our conclusions about the number of hidden layers to only cover this\n        specific choice of activation function. Alternatively, it could be a\n        nuisance parameter if we are prepared to tune it separately for each\n        number of hidden layers.\n-   Whether a particular hyperparameter is a scientific hyperparameter, nuisance\n    hyperparameter, or fixed hyperparameter is not inherent to that\n    hyperparameter, but changes depending on the experimental goal.\n    -   For example, the choice of activation function could be a scientific\n        hyperparameter (is ReLU or tanh a better choice for our problem?), a\n        nuisance hyperparameter (is the best 5-layer model better than the best\n        6-layer model when we allow several different possible activation\n        functions?), or a fixed hyperparameter (for ReLU nets, does adding batch\n        normalization in a particular position help?).\n-   When designing a new round of experiments, we first identify the scientific\n    hyperparameters for our experimental goal.\n    -   At this stage, we consider all other hyperparameters to be nuisance\n        hyperparameters.\n-   Next, we convert some of the nuisance hyperparameters into fixed\n    hyperparameters.\n    -   With limitless resources, we would leave all non-scientific\n        hyperparameters as nuisance hyperparameters so that the conclusions we\n        draw from our experiments are free from caveats about fixed\n        hyperparameter values.\n    -   However, the more nuisance hyperparameters we attempt to tune, the\n        greater the risk we fail to tune them sufficiently well for each setting\n        of the scientific hyperparameters and end up reaching the wrong\n        conclusions from our experiments.\n        -   As described\n            [below](#striking-a-balance-between-informative-and-affordable-experiments),\n            we could counter this risk by increasing the computational budget,\n            but often our maximum resource budget is less than would be needed\n            to tune over all non-scientific hyperparameters.\n    -   We choose to convert a nuisance hyperparameter into a fixed\n        hyperparameter when, in our judgment, the caveats introduced by fixing\n        it are less burdensome than the cost of including it as a nuisance\n        hyperparameter.\n        -   The more a given nuisance hyperparameter interacts with the\n            scientific hyperparameters, the more damaging it is to fix its\n            value. For example, the best value of the weight decay strength\n            typically depends on the model size, so comparing different model\n            sizes assuming a single specific value of the weight decay would not\n            be very insightful.\n-   Although the type we assign to each hyperparameter depends on the\n    experimental goal, we have the following rules of thumb for certain\n    categories of hyperparameters:\n    -   Of the various optimizer hyperparameters (e.g. the learning rate,\n        momentum, learning rate schedule parameters, Adam betas etc.), at least\n        some of them will be nuisance hyperparameters because they tend to\n        interact the most with other changes.\n        -   They are rarely scientific hyperparameters because a goal like \"what\n            is the best learning rate for the current pipeline?\" doesn't give\n            much insight – the best setting could easily change with the next\n            pipeline change anyway.\n        -   Although we might fix some of them occasionally due to resource\n            constraints or when we have particularly strong evidence that they\n            don't interact with the scientific parameters, we should generally\n            assume that optimizer hyperparameters must be tuned separately to\n            make fair comparisons between different settings of the scientific\n            hyperparameters, and thus shouldn't be fixed.\n            -   Furthermore, we have no *a priori* reason to prefer one\n                optimizer hyperparameter value over another (e.g. they don't\n                usually affect the computational cost of forward passes or\n                gradients in any way).\n    -   In contrast, the *choice* of optimizer is typically a scientific\n        hyperparameter or fixed hyperparameter.\n        -   It is a scientific hyperparameter if our experimental goal involves\n            making fair comparisons between two or more different optimizers\n            (e.g. \"determine which optimizer produces the lowest validation\n            error in a given number of steps\").\n        -   Alternatively, we might make it a fixed hyperparameter for a variety\n            of reasons, including (1) prior experiments make us believe that the\n            best optimizer for our problem is not sensitive to current\n            scientific hyperparameters; and\u002For (2) we prefer to compare values\n            of the scientific hyperparameters using this optimizer because its\n            training curves are easier to reason about; and\u002For (3) we prefer to\n            use this optimizer because it uses less memory than the\n            alternatives.\n    -   Hyperparameters introduced by a regularization technique are typically\n        nuisance hyperparameters, but whether or not we include the\n        regularization technique at all is a scientific or fixed hyperparameter.\n        -   For example, dropout adds code complexity, so when deciding whether\n            to include it we would make \"no dropout\" vs \"dropout\" a scientific\n            hyperparameter and the dropout rate a nuisance hyperparameter.\n            -   If we decide to add dropout to our pipeline based on this\n                experiment, then the dropout rate would be a nuisance\n                hyperparameter in future experiments.\n    -   Architectural hyperparameters are often scientific or fixed\n        hyperparameters because architecture changes can affect serving and\n        training costs, latency, and memory requirements.\n        -   For example, the number of layers is typically a scientific or fixed\n            hyperparameter since it tends to have dramatic consequences for\n            training speed and memory usage.\n-   In some cases, the sets of nuisance and fixed hyperparameters will depend on\n    the values of the scientific hyperparameters.\n    -   For example, suppose we are trying to determine which optimizer out of\n        Nesterov momentum and Adam results in the lowest validation error. The\n        scientific hyperparameter is the `optimizer`, which takes values\n        `{\"Nesterov_momentum\", \"Adam\"}`. The value\n        `optimizer=\"Nesterov_momentum\"` introduces the nuisance\u002Ffixed\n        hyperparameters `{learning_rate, momentum}`, but the value\n        `optimizer=\"Adam\"` introduces the nuisance\u002Ffixed hyperparameters\n        `{learning_rate, beta1, beta2, epsilon}`.\n    -   Hyperparameters that are only present for certain values of the\n        scientific hyperparameters are called **conditional hyperparameters**.\n    -   We should not assume two conditional hyperparameters are the same just\n        because they have the same name! In the above example, the conditional\n        hyperparameter called `learning_rate` is a *different* hyperparameter\n        for `optimizer=\"Nesterov_momentum\"` versus `optimizer=\"Adam\"`. Its role\n        is similar (although not identical) in the two algorithms, but the range\n        of values that work well in each of the optimizers is typically\n        different by several orders of magnitude.\n\n\u003C\u002Fdetails>\n\n#### Creating a set of studies\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   Once we have identified the scientific and nuisance hyperparameters, we\n    design a \"study\" or sequence of studies to make progress towards the\n    experimental goal.\n    -   A study specifies a set of hyperparameter configurations to be run for\n        subsequent analysis. Each configuration is called a \"trial\".\n    -   Creating a study typically involves choosing the hyperparameters that\n        will vary across trials, choosing what values those hyperparameters can\n        take on (the \"search space\"), choosing the number of trials, and\n        choosing an automated search algorithm to sample that many trials from\n        the search space. Alternatively, we could create a study by specifying\n        the set of hyperparameter configurations manually.\n-   The purpose of the studies is to run the pipeline with different values of\n    the scientific hyperparameters, while at the same time **\"optimizing away\"**\n    (or \"optimizing over\") the nuisance hyperparameters so that comparisons\n    between different values of the scientific hyperparameters are as fair as\n    possible.\n-   In the simplest case, we would make a separate study for each configuration\n    of the scientific parameters, where each study tunes over the nuisance\n    hyperparameters.\n    -   For example, if our goal is to select the best optimizer out of Nesterov\n        momentum and Adam, we could create one study in which\n        `optimizer=\"Nesterov_momentum\"` and the nuisance hyperparameters are\n        `{learning_rate, momentum}`, and another study in which\n        `optimizer=\"Adam\"` and the nuisance hyperparameters are `{learning_rate,\n        beta1, beta2, epsilon}`. We would compare the two optimizers by\n        selecting the best performing trial from each study.\n    -   We can use any gradient-free optimization algorithm, including methods\n        such as Bayesian optimization or evolutionary algorithms, to optimize\n        over the nuisance hyperparameters, although\n        [we prefer](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)\n        to use quasi-random search in the\n        [exploration phase](#exploration-vs-exploitation) of tuning because of a\n        variety of advantages it has in this setting.\n        [After exploration concludes](#after-exploration-concludes), if\n        state-of-the-art Bayesian optimization software is available, that is\n        our preferred choice.\n-   In the more complicated case where we want to compare a large number of\n    values of the scientific hyperparameters and it is impractical to make that\n    many independent studies, we can include the scientific parameters in the\n    same search space as the nuisance hyperparameters and use a search algorithm\n    to sample values of *both* the scientific and nuisance hyperparameters in a\n    single study.\n    -   When taking this approach, conditional hyperparameters can cause\n        problems since it is hard to specify a search space unless the set of\n        nuisance hyperparameters is the same for all values of the scientific\n        hyperparameters.\n    -   In this case,\n        [our preference](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)\n        for using quasi-random search over fancier black-box optimization tools\n        is even stronger, since it ensures that we obtain a relatively uniform\n        sampling of values of the scientific hyperparameters. Regardless of the\n        search algorithm, we need to make sure somehow that it searches the\n        scientific parameters uniformly.\n\n\u003C\u002Fdetails>\n\n#### Striking a balance between informative and affordable experiments\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   When designing a study or sequence of studies, we need to allocate a limited\n    budget in order to adequately achieve the following three desiderata:\n    1.  Comparing enough different values of the scientific hyperparameters.\n    2.  Tuning the nuisance hyperparameters over a large enough search space.\n    3.  Sampling the search space of nuisance hyperparameters densely enough.\n-   The better we can achieve these three desiderata, the more insight we can\n    extract from our experiment.\n    -   Comparing as many values of the scientific hyperparameters as possible\n        broadens the scope of the insights we gain from the experiment.\n    -   Including as many nuisance hyperparameters as possible and allowing each\n        nuisance hyperparameter to vary over as wide a range as possible\n        increases our confidence that a \"good\" value of the nuisance\n        hyperparameters **exists** in the search space for each configuration of\n        the scientific hyperparameters.\n        -   Otherwise, we might make unfair comparisons between values of the\n            scientific hyperparameters by not searching possible regions of the\n            nuisance parameter space where better values might lie for some\n            values of the scientific parameters.\n    -   Sampling the search space of nuisance hyperparameters as densely as\n        possible increases our confidence that any good settings for the\n        nuisance hyperparameters that happen to exist in our search space will\n        be found by the search procedure.\n        -   Otherwise, we might make unfair comparisons between values of the\n            scientific parameters due to some values getting luckier with the\n            sampling of the nuisance hyperparameters.\n-   Unfortunately, improvements in *any* of these three dimensions require\n    either increasing the number of trials, and therefore increasing the\n    resource cost, or finding a way to save resources in one of the other\n    dimensions.\n    -   Every problem has its own idiosyncrasies and computational constraints,\n        so how to allocate resources across these three desiderata requires some\n        level of domain knowledge.\n    -   After running a study, we always try to get a sense of whether the study\n        tuned the nuisance hyperparameters well enough (i.e. searched a large\n        enough space extensively enough) to fairly compare the scientific\n        hyperparameters (as described in greater detail\n        [below](#extracting-insight-from-experimental-results)).\n\n\u003C\u002Fdetails>\n\n### Extracting insight from experimental results\n\n***Summary:*** *In addition to trying to achieve the original scientific goal of\neach group of experiments, go through a checklist of additional questions and,\nif issues are discovered, revise the experiments and rerun them.*\n\n-   Ultimately, each group of experiments has a specific goal and we want to\n    evaluate the evidence the experiments provide toward that goal.\n    -   However, if we ask the right questions, we will often find issues that\n        need to be corrected before a given set of experiments can make much\n        progress towards their original goal.\n        -   If we don’t ask these questions, we may draw incorrect conclusions.\n    -   Since running experiments can be expensive, we also want to take the\n        opportunity to extract other useful insights from each group of\n        experiments, even if these insights are not immediately relevant to the\n        current goal.\n-   Before analyzing a given set of experiments to make progress toward their\n    original goal, we should ask ourselves the following additional questions:\n    -   [Is the search space large enough?](#identifying-bad-search-space-boundaries)\n        -   If the optimal point from a study is near the boundary of the search\n            space in one or more dimensions, the search is probably not wide\n            enough. In this case, we should run another study with an expanded\n            search space.\n    -   [Have we sampled enough points from the search space?](#not-sampling-enough-points-in-the-search-space)\n        -   If not, run more points or be less ambitious in the tuning goals.\n    -   What fraction of the trials in each study are **infeasible** (i.e.\n        trials that diverge, get really bad loss values, or fail to run at all\n        because they violate some implicit constraint)?\n        -   When a very large fraction of points in a study are **infeasible**\n            we should try to adjust the search space to avoid sampling such\n            points, which sometimes requires reparameterizing the search space.\n        -   In some cases, a large number of infeasible points can indicate a\n            bug in the training code.\n    -   [Does the model exhibit optimization issues?](#how-can-optimization-failures-be-debugged-and-mitigated)\n    -   [What can we learn from the training curves of the best trials?](#examining-the-training-curves)\n        -   For example, do the best trials have training curves consistent with\n            problematic overfitting?\n-   If necessary, based on the answers to the questions above, refine the most\n    recent study (or group of studies) to improve the search space and\u002For sample\n    more trials, or take some other corrective action.\n-   Once we have answered the above questions, we can move on to evaluating the\n    evidence the experiments provide towards our original goal (for example,\n    [evaluating whether a change is useful](#detecting-whether-a-change-is-useful-with-isolation-plots)).\n\n#### Identifying bad search space boundaries\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   A search space is suspicious if the best point sampled from it is close to\n    its boundary. We might find an even better point if we expanded the search\n    range in that direction.\n-   To check search space boundaries, we like to plot completed trials on what\n    we call **basic hyperparameter axis plots** where we plot the validation\n    objective value versus one of the hyperparameters (e.g. learning rate). Each\n    point on the plot corresponds to a single trial.\n    -   The validation objective value for each trial should usually be the best\n        value it achieved over the course of training.\n\n\u003Cp align=\"center\" id=\"figure-1\">\n    \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fmain\u002Fassets\u002Fbad_search_space.png\" width=\"49%\" alt=\"Example of bad search space boundaries\">\n\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fmain\u002Fassets\u002Fgood_search_space.png\" width=\"49%\" alt=\"Example of good search space boundaries\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 1:\u003C\u002Fb> Examples of bad search space boundaries and acceptable search space boundaries.\u003C\u002Fp>\n\n-   The plots in [Figure 1](#figure-1) show the error rate (lower is better)\n    against the initial learning rate.\n-   If the best points cluster towards the edge of a search space (in some\n    dimension), then the search space boundaries might need to be expanded until\n    the best observed point is no longer close to the boundary.\n-   Often, a study will include \"infeasible\" trials that diverge or get very bad\n    results (marked with red Xs in the above plots).\n    -   If all trials are infeasible for learning rates greater than some\n        threshold value, and if the best performing trials have learning rates\n        at the edge of that region, the model [may suffer from stability issues\n        preventing it from accessing higher learning\n        rates](#how-can-optimization-failures-be-debugged-and-mitigated).\n\n\u003C\u002Fdetails>\n\n#### Not sampling enough points in the search space\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   In general,\n    [it can be very difficult to know](#how-many-trials-are-needed-to-get-good-results-with-quasi-random-search)\n    if the search space has been sampled densely enough. 🤖\n-   Running more trials is of course better, but comes at an obvious cost.\n-   Since it is so hard to know when we have sampled enough, we usually sample\n    what we can afford and try to calibrate our intuitive confidence from\n    repeatedly looking at various hyperparameter axis plots and trying to get a\n    sense of how many points are in the \"good\" region of the search space.\n\n\u003C\u002Fdetails>\n\n#### Examining the training curves\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n***Summary:*** *Examining the training curves is an easy way to identify common\nfailure modes and can help us prioritize what actions to take next.*\n\n-   Although in many cases the primary objective of our experiments only\n    requires considering the validation error of each trial, we must be careful\n    when reducing each trial to a single number because it can hide important\n    details about what’s going on below the surface.\n-   For every study, we always look at the **training curves** (training error\n    and validation error plotted versus training step over the duration of\n    training) of at least the best few trials.\n-   Even if this is not necessary for addressing the primary experimental\n    objective, examining the training curves is an easy way to identify common\n    failure modes and can help us prioritize what actions to take next.\n-   When examining the training curves, we are interested in the following\n    questions.\n-   Are any of the trials exhibiting **problematic overfitting?**\n    -   Problematic overfitting occurs when the validation error starts\n        *increasing* at some point during training.\n    -   In experimental settings where we optimize away nuisance hyperparameters\n        by selecting the \"best\" trial for each setting of the scientific\n        hyperparameters, we should check for problematic overfitting in *at\n        least* each of the best trials corresponding to the settings of the\n        scientific hyperparameters that we’re comparing.\n        -   If any of the best trials exhibits problematic overfitting, we\n            usually want to re-run the experiment with additional regularization\n            techniques and\u002For better tune the existing regularization parameters\n            before comparing the values of the scientific hyperparameters.\n            -   This may not apply if the scientific hyperparameters include\n                regularization parameters, since then it would not be surprising\n                if low-strength settings of those regularization parameters\n                resulted in problematic overfitting.\n        -   Reducing overfitting is often straightforward using common\n            regularization techniques that add minimal code complexity or extra\n            computation (e.g. dropout, label smoothing, weight decay), so it’s\n            usually no big deal to add one or more of these to the next round of\n            experiments.\n        -   For example, if the scientific hyperparameter is \"number of hidden\n            layers\" and the best trial that uses the largest number of hidden\n            layers exhibited problematic overfitting, then we would usually\n            prefer to try it again with additional regularization instead of\n            immediately selecting the smaller number of hidden layers.\n        -   Even if none of the \"best\" trials are exhibiting problematic\n            overfitting, there might still be a problem if it occurs in *any* of\n            the trials.\n            -   Selecting the best trial suppresses configurations exhibiting\n                problematic overfitting and favors those that do not. In other\n                words, it will favor configurations with more regularization.\n            -   However, anything that makes training worse can act as a\n                regularizer, even if it wasn't intended that way. For example,\n                choosing a smaller learning rate can regularize training by\n                hobbling the optimization process, but we typically don't want\n                to choose the learning rate this way.\n            -   So we must be aware that the \"best\" trial for each setting of\n                the scientific hyperparameters might be selected in such a way\n                that favors \"bad\" values of some of the scientific or nuisance\n                hyperparameters.\n-   Is there high step-to-step variance in the training or validation error late\n    in training?\n    -   If so, this could interfere with our ability to compare different values\n        of the scientific hyperparameters (since each trial randomly ends on a\n        \"lucky\" or \"unlucky\" step) and our ability to reproduce the result of\n        the best trial in production (since the production model might not end\n        on the same \"lucky\" step as in the study).\n    -   The most likely causes of step-to-step variance are batch variance (from\n        randomly sampling examples from the training set for each batch), small\n        validation sets, and using a learning rate that’s too high late in\n        training.\n    -   Possible remedies include increasing the batch size, obtaining more\n        validation data, using learning rate decay, or using Polyak averaging.\n-   Are the trials still improving at the end of training?\n    -   If so, this indicates that we are in the\n        [\"compute bound\" regime](#determining-the-number-of-steps-for-each-training-run)\n        and we may benefit from\n        [increasing the number of training steps](#Deciding-how-long-to-train-when-training-is-compute-bound)\n        or changing the learning rate schedule.\n-   Has performance on the training and validation sets saturated long before\n    the final training step?\n    -   If so, this indicates that we are in the\n        [\"not compute-bound\"](#determining-the-number-of-steps-for-each-training-run)\n        regime and that we may be able to\n        [decrease the number of training steps](#deciding-how-long-to-train-when-training-is-not-compute-bound).\n-   Although we cannot enumerate them all, there are many other additional\n    behaviors that can become evident from examining the training curves (e.g.\n    training loss *increasing* during training usually indicates a bug in the\n    training pipeline).\n\n\u003C\u002Fdetails>\n\n#### Detecting whether a change is useful with isolation plots\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n\u003Cp align=\"center\" id=\"figure-2\">\n\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fmain\u002Fassets\u002Fisolation_plot.png\" width=\"49%\" alt=\"Isolation plot that investigates the best value of weight decay for ResNet-50\ntrained on ImageNet.\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 2:\u003C\u002Fb> Isolation plot that investigates the best value of weight decay for ResNet-50 trained on ImageNet.\u003C\u002Fp>\n\n-   Often, the goal of a set of experiments is to compare different values of a\n    scientific hyperparameter.\n    -   For example, we may want to determine the value of weight decay that\n        results in the best validation error.\n-   An **isolation plot** is a special case of the basic hyperparameter axis\n    plot. Each point on an isolation plot corresponds to the performance of the\n    *best* trial across some (or all) of the nuisance hyperparameters.\n    -   In other words, we plot the model performance after \"optimizing away\"\n        the nuisance hyperparameters.\n-   An isolation plot makes it easier to perform an apples-to-apples comparison\n    between different values of the scientific hyperparameter.\n-   For example, [Figure 2](#figure-2) reveals the value of weight decay that\n    produces the best validation performance for a particular configuration of\n    ResNet-50 trained on ImageNet.\n    -   If our goal is to determine whether to include weight decay at all, then\n        we would compare the best point from this plot against the baseline of\n        no weight decay. For a fair comparison, the baseline should also have\n        its learning rate equally well tuned.\n-   When we have data generated by (quasi)random search and are considering a\n    continuous hyperparameter for an isolation plot, we can approximate the\n    isolation plot by bucketing the x-axis values of the basic hyperparameter\n    axis plot and taking the best trial in each vertical slice defined by the\n    buckets.\n\n\u003C\u002Fdetails>\n\n#### Automate generically useful plots\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   The more effort it is to generate plots, the less likely we are to look at\n    them as much as we should, so it behooves us to set up our infrastructure to\n    automatically produce as many of them as possible.\n-   At a minimum, we automatically generate basic hyperparameter axis plots for\n    all hyperparameters that we vary in an experiment.\n-   Additionally, we automatically produce training curves for all trials and\n    make it as easy as possible to find the best few trials of each study and\n    examine their training curves.\n-   There are many other potential plots and visualizations we can add that can\n    be useful. Although the ones described above are a good starting point, to\n    paraphrase Geoffrey Hinton, \"Every time you plot something new, you learn\n    something new.\"\n\n\u003C\u002Fdetails>\n\n### Determining whether to adopt a training pipeline change or hyperparameter configuration\n\n***Summary:*** *When deciding whether to make a change to our model or training\nprocedure or adopt a new hyperparameter configuration going forward, we need to\nbe aware of the different sources of variation in our results.*\n\n-   When we are trying to improve our model, we might observe that a particular\n    candidate change initially achieves a better validation error compared to\n    our incumbent configuration, but find that after repeating the experiment\n    there is no consistent advantage. Informally, we can group the most\n    important sources of variation that might cause such an inconsistent result\n    into the following broad categories:\n    -   **Training procedure variance**, **retrain variance**, or **trial\n        variance**: the variation we see between training runs that use the same\n        hyperparameters, but different random seeds.\n        -   For example, different random initializations, training data\n            shuffles, dropout masks, patterns of data augmentation operations,\n            and orderings of parallel arithmetic operations, are all potential\n            sources of trial variance.\n    -   **Hyperparameter search variance**, or **study variance**: the variation\n        in results caused by our procedure to select the hyperparameters.\n        -   For example, we might run the same experiment with a particular\n            search space, but with two different seeds for quasi-random search\n            and end up selecting different hyperparameter values.\n    -   **Data collection and sampling variance**: the variance from any sort of\n        random split into training, validation, and test data or variance due to\n        the training data generation process more generally.\n-   It is all well and good to make comparisons of validation error rates\n    estimated on a finite validation set using fastidious statistical tests, but\n    often the trial variance alone can produce statistically significant\n    differences between two different trained models that use the same\n    hyperparameter settings.\n-   We are most concerned about study variance when trying to make conclusions\n    that go beyond the level of an individual point in hyperparameters space.\n    -   The study variance depends on the number of trials and the search space\n        and we have seen cases where it is larger than the trial variance as\n        well as cases where it is much smaller.\n-   Therefore, before adopting a candidate change, consider running the best\n    trial N times to characterize the run-to-run trial variance.\n    -   Usually, we can get away with only recharacterizing the trial variance\n        after major changes to the pipeline, but in some applications we might\n        need fresher estimates.\n    -   In other applications, characterizing the trial variance is too costly\n        to be worth it.\n-   At the end of the day, although we only want to adopt changes (including new\n    hyperparameter configurations) that produce real improvements, demanding\n    complete certainty that something helps isn't the right answer either.\n-   Therefore, if a new hyperparameter point (or other change) gets a better\n    result than the baseline (taking into account the retrain variance of both\n    the new point and the baseline as best we can), then we probably should\n    adopt it as the new baseline for future comparisons.\n    -   However, we should only adopt changes that produce improvements that\n        outweigh any complexity they add.\n\n### After exploration concludes\n\n***Summary:*** *Bayesian optimization tools are a compelling option once we’re\ndone exploring for good search spaces and have decided what hyperparameters even\nshould be tuned at all.*\n\n-   At some point, our priorities will shift from learning more about the tuning\n    problem to producing a single best configuration to launch or otherwise use.\n-   At this point, there should be a refined search space that comfortably\n    contains the local region around the best observed trial and has been\n    adequately sampled.\n-   Our exploration work should have revealed the most essential hyperparameters\n    to tune (as well as sensible ranges for them) that we can use to construct a\n    search space for a final automated tuning study using as large a tuning\n    budget as possible.\n-   Since we no longer care about maximizing our insight into the tuning\n    problem, many of\n    [the advantages of quasi-random search](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)\n    no longer apply and Bayesian optimization tools should be used to\n    automatically find the best hyperparameter configuration.\n    -   [Open-Source Vizier](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier) impleme","该项目旨在为工程师和研究人员提供一套系统化的方法来最大化深度学习模型的性能。其核心功能包括选择模型架构、优化器及批量大小等关键配置，并采用科学的方法逐步改进模型表现，如增量调优策略与实验设计原则。此外，还提供了关于训练管道优化、性能评估、检查点保存等方面的实用指南。适合于希望提高自身深度学习项目效果的专业人士或团队使用，在面对复杂问题时能够帮助他们更高效地探索并确定最佳实践方案。",2,"2026-06-11 03:42:58","high_star"]