Port Scripts to Use the Cray Programming Environment Deep Learning Plugin
Provides instructions for porting a TensorFlow training script to use the Cray PE DL plugin,
This procedure requires Urika-CS software installed on a Cray CS system.
To port a TensorFlow training script to use the Cray PE DL plugin, it is recommended to start with a training that executes serially, i.e., it does not use distributed TensorFlow. With that script, the following modifications are necessary to use the plugin:
- Initializing the Cray PE DL plugin, specifying the number of teams, threads, model size
- Broadcasting initial model parameter values
- Using the Cray PE DL plugin to communicate gradients after gradient computation and before the model update
- Finalizing the Cray PE DL plugin
Modifications that are not required by the Cray PE DL plugin, yet common for parallelization, include:
- Correcting the definition of an epoch for the global mini-batch size (all processes)
- Correcting the learning rate:
- Linear or square root scaling rule
- Adding a learning rate decay schedule
- Average performance metrics using Cray PE DL plugin helper functionsOnly a single rank produces the desired prints. In addition, either a single rank writes checkpoints or each rank writes to a unique location
The following code excerpts (from the MNIST example provided with the release) review the required modifications for using the Cray PE DL plugin.