Hyperparameter Optimization (HPO) Support

Overview and information about the supported HPO techniques on the software.

Deep learning algorithms require a significant amount of intuition and guesswork to design solutions to new problems. Hyperparameter optimization is a way to remove this guesswork from many of the design decisions that go into a model.

Traditional HPO techniques can sweep or optimize over any model structure parameter, such as the number of neural network layers, the size of those layers, activation functions, and many others. Additionally, training parameters, such as learning rate, weight decay, and dropout, can be optimized by both traditional HPO and population based training (PBT).

HPO typically executes iteratively over generations. These generations are populated by unique sets of hyperparameters to be evaluated. The evaluation is generally the final loss value for the trained network, accuracy, or some other metric to be minimized. When each member of the generation has been evaluated, a new generation is populated by the underlying algorithm. This can happen in either a brute force way, which would likely target smaller models, or an optimized way that would target larger models, where a complete search of possible hyperparameters would be simply too resource intensive.

About the crayai Module

As part of this release of Urika-XCS, Cray is providing early access to our HPO package accessed through the crayai Python module. This module is activated as part of the typical analytics packages and can be run independently of the Urika-XCS run_training and start_analytics commands.

crayai exposes two potential options for distribution of hyperparameter evaluation. For those running with Slurm as a workload manager, crayai can interface directly with Slurm running natively on the login nodes. If the training process depends on the packages supported by the analytics image, such as TensorFlow or PyTorch, Urika-XCS can also be used as the launcher. In this case, all three supported workload mangers supported by Urika-XCS are supported. Additionally, all evaluations will run within Urika-XCS Shifter containers, which will retain a consistent development environment, while allowing customizations through Anaconda Python.

Supported HPO Techniques

The following HPO techniques are supported on Urika-XCS:
  • Genetic - The Cray genetic HPO algorithm is an optimization technique. It relies on a genetic deep learning algorithm to learn ideal sets of hyperparameters, based on prior evaluations. This happens by treating the final loss or accuracy value returned to the crayai HPO submodule as a "Figure of Merit", which is then used to judge the quality of those hyperparameters. Based on this judgement, poor performing hyperparameters are pruned and successful hyperparameters are "mutated" or augmented by a small factor and "crossover" is applied where the hyperparameters from two individuals are combined to create a new individual.
  • Random - The random HPO algorithm is a simple hyperparameter search technique that relies on brute force and random chance to find better combinations of hyperparameters.
  • Grid - The grid HPO algorithm is another simple hyperparameter search technique that relies on a more methodical sweep of a defined search space. An N-dimensional matrix is defined based on the hyperparameters provided and each element is evaluated.
  • Population Based Training (PBT) - Population based training is a specific application of an HPO algorithm with the intent of intelligently learning a schedule for training parameters, such as learning rate and weight decay. Typically, these schedules are set like other hyperparameters, using intuition, trial, and error.

    PBT allows these values to vary from update to update among some number of candidates. At the end of a training window these candidates are evaluated and by using the genetic approach above, the lowest performing candidates are dropped and the best performing ones are mutated and augmented. Unlike traditional HPO, after evaluation the training continues through a checkpoint model with a fresh batch of candidate parameters. By setting this window properly, parameters, such as learning rate, can vary based on the current training environment. Due to the dependence on checkpointing and restoring models, the model structure must remain static during a training process.

    With Cray's distributed HPO framework and sufficient hardware resources, population based trainings can distribute the evaluation of candidate training parameters, leading to a total PBT training time that is similar to what it would take to run a single training process, but with significantly better results in most cases than would be achieved by setting heuristic training parameter schedules.