# Documentation

### This is machine translation

Translated by
Mouseover text to see original. Click the button below to return to the English verison of the page.

# trainingOptions

Options for training neural network

## Syntax

options = trainingOptions(solverName)
options = trainingOptions(solverName,Name,Value)

## Description

options = trainingOptions(solverName) returns a set of training options for the solver specified by solverName.

example

options = trainingOptions(solverName,Name,Value) returns a set of training options with additional options specified by one or more Name,Value pair arguments.

## Examples

collapse all

Create a set of options for training a network using stochastic gradient descent with momentum. Reduce the learning rate by a factor of 0.2 every 5 epochs. Set the maximum number of epochs for training at 20, and use a mini-batch with 300 observations at each iteration. Specify a path for saving checkpoint networks after every epoch.

options = trainingOptions('sgdm',...
'LearnRateSchedule','piecewise',...
'LearnRateDropFactor',0.2,...
'LearnRateDropPeriod',5,...
'MaxEpochs',20,...
'MiniBatchSize',300,...
'CheckpointPath','C:\TEMP\checkpoint');

Plot the training accuracy at each iteration of the training process.

[XTrain,YTrain] = digitTrain4DArrayData;

Construct a simple network to classify the digit image data.

layers = [ ...
imageInputLayer([28 28 1],'Normalization','none')
convolution2dLayer(6,20)
reluLayer
maxPooling2dLayer(2,'Stride',2)
fullyConnectedLayer(10)
softmaxLayer
classificationLayer];

Save the function plotTrainingAccuracy on the MATLAB® path that plots training accuracy against the current iteration. plotTrainingAccuracy is defined at the end of this example.

Specify the training options. Set 'OutputFcn' to be the plotTrainingAccuracy function. For quick training, set 'MaxEpochs' to 5 and 'InitialLearnRate' to 0.1. Train the network using trainNetwork.

options = trainingOptions('sgdm','Verbose',false, ...
'MaxEpochs',5, ...
'InitialLearnRate',0.1, ...
'OutputFcn',@plotTrainingAccuracy);

net = trainNetwork(XTrain,YTrain,layers,options);

Use the custom function plotTrainingAccuracy to plot info.TrainingAccuracy against info.Iteration at each function call.

function plotTrainingAccuracy(info)

persistent plotObj

if info.State == "start"
plotObj = animatedline;
xlabel("Iteration")
ylabel("Training Accuracy")
elseif info.State == "iteration"
drawnow limitrate nocallbacks
end

end

Plot the training accuracy at each iteration, and if the mean accuracy of the previous 50 iterations reaches 95%, then stop training early.

[XTrain,YTrain] = digitTrain4DArrayData;

Construct a simple network to classify the digit image data.

layers = [ ...
imageInputLayer([28 28 1],'Normalization','none')
convolution2dLayer(6,20)
reluLayer
maxPooling2dLayer(2,'Stride',2)
fullyConnectedLayer(10)
softmaxLayer
classificationLayer];

Save the custom output functions plotTrainingAccuracy and stopTrainingAtThreshold on the MATLAB® path. plotTrainingAccuracy plots training progress, and if the mean accuracy of the previous 50 iterations reaches 95%, then stopTrainingAtThreshold stops training early. These functions are defined at the end of this example.

Specify custom output functions as a cell array of function handles. Set the output functions to be plotTrainingAccuracy, and stopTrainingAtThreshold with a 95% threshold.

functions = { ...
@plotTrainingAccuracy, ...
@(info) stopTrainingAtThreshold(info,95)};

Specify the training options. Set 'OutputFcn' to be the cell array of function handles functions. Train the network using trainNetwork.

options = trainingOptions('sgdm','Verbose',false, ...
'InitialLearnRate',0.1, ...
'OutputFcn',functions);

net = trainNetwork(XTrain,YTrain,layers,options);

Update the plot at each iteration using plotTrainingAccuracy and stopTrainingAtThreshold. Use the custom function plotTrainingAccuracy to plot info.TrainingAccuracy against info.Iteration. Use stopTrainingAtThreshold(info,thr) to stop training if the mean accuracy of the previous 50 iterations is greater than thr.

function plotTrainingAccuracy(info)

persistent plotObj

if info.State == "start"
plotObj = animatedline;
xlabel("Iteration")
ylabel("Training Accuracy")
elseif info.State == "iteration"
drawnow limitrate nocallbacks
end

end

function stop = stopTrainingAtThreshold(info,thr)

stop = false;
if info.State ~= "iteration"
return
end

persistent iterationAccuracy

% Append accuracy for this iteration
iterationAccuracy = [iterationAccuracy info.TrainingAccuracy];

% Evaluate mean of iteration accuracy and remove oldest entry
if numel(iterationAccuracy) == 50
stop = mean(iterationAccuracy) > thr;

iterationAccuracy(1) = [];
end

end

## Input Arguments

collapse all

Solver to use for training the network. You must specify 'sgdm' (stochastic gradient descent with momentum).

### Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'InitialLearningRate',0.03,'L2Regularization',0.0005,'LearnRateSchedule','piecewise' specifies the initial learning rate as 0.03, and the L2 regularization factor as 0.0005, and instructs the software to drop the learning rate every given number of epochs by multiplying with a set factor.

collapse all

Path for saving the checkpoint networks, specified as the comma-separated pair consisting of 'CheckpointPath' and a character vector.

• If you do not specify a path (i.e., ''), then the software does not save any checkpoint networks.

• If you specify a path, then trainNetwork saves checkpoint networks to this path after every epoch. It automatically and uniquely names each network. You can then load any of these networks and resume training from that network.

If the directory is not already created, you must first create it before specifying the path to save the checkpoint networks. If the path you specify is wrong, then trainingOptions returns an error.

Example: 'CheckpointPath','C:\Temp\checkpoint'

Data Types: char

Hardware resource for trainNetwork to train the network, specified as the comma-separated pair consisting of 'ExecutionEnvironment' and one of the following:

• 'auto' — Use a GPU if it is available, otherwise uses the CPU.

• 'cpu' — Use the CPU.

• 'gpu' — Use the GPU.

• 'multi-gpu' — Use multiple GPUs on one machine, using a local parallel pool. If no pool is already open, trainNetwork opens one with one worker per supported GPU device.

• 'parallel' — Use a local parallel pool or compute cluster. If no pool is already open, trainNetwork opens one using the default cluster profile. If the pool has access to GPUs, then trainNetwork uses them and excess workers are left idle. If the pool does not have GPUs, then the training takes place on all cluster CPUs.

'gpu', 'multi-gpu', and 'parallel' options require Parallel Computing Toolbox™. Additionally, to use a GPU, you must have a CUDA®-enabled NVIDIA® GPU with compute capability 3.0 or higher. If one of these options are chosen and Parallel Computing Toolbox or a suitable GPU is not available, trainNetwork returns an error.

To see an improvement in performance when training in parallel, you might need to increase MiniBatchSize to offset the communication overhead.

Example: 'ExecutionEnvironment','cpu'

Data Types: char

Initial learning rate used for training, specified as the comma-separated pair consisting of 'InitialLearnRate' and a positive scalar value. If the learning rate is too low, the training takes a long time, but if it is too high the training might reach a suboptimal result.

Example: 'InitialLearnRate',0.03

Data Types: single | double

Option for dropping the learning rate during training, specified as the comma-separated pair consisting of 'LearnRateSchedule' and one of the following:

• 'none' — The learning rate remains constant through training.

• 'piecewise' — The software updates the learning rate every certain number of epochs by multiplying with a factor. Use the LearnRateDropFactor name-value pair argument to specify the value of this factor. Use the LearnRateDropPeriod name-value pair argument to specify the number of epochs between multiplications.

Example: 'LearnRateSchedule','piecewise'

Factor for dropping the learning rate, specified as the comma-separated pair consisting of 'LearnRateDropFactor' and a scalar value. This option is valid only when the value of LearnRateSchedule is 'piecewise'.

LearnRateDropFactor is a multiplicative factor to apply to the learning rate every time a certain number of epochs has passed. You can specify the number of epochs using the LearnRateDropPeriod name-value pair argument.

Example: 'LearnRateDropFactor',0.02

Data Types: single | double

Number of epochs for dropping the learning rate, specified as the comma-separated pair consisting of 'LearnRateDropPeriod' and an integer value. This option is valid only when the value of LearnRateSchedule is 'piecewise'.

The software multiplies the global learning rate with the drop factor every time this number of epochs passes. The drop factor is specified by the LearnRateDropFactor name-value pair argument.

Example: 'LearnRateDropPeriod',3

Data Types: single | double

Factor for L2 regularizer (weight decay), specified as the comma-separated pair consisting of 'L2Regularization' and a positive scalar value.

You can specify a multiplier for this L2 regularizer when creating the convolutional layer and fully connected layer.

Example: 'L2Regularization',0.0005

Data Types: single | double

Maximum number of epochs to use for training, specified as the comma-separated pair consisting of 'MaxEpochs' and an integer value.

An iteration is one step taken in the gradient descent algorithm towards minimizing the loss function using a mini batch. An epoch is the full pass of the training algorithm over the entire training set.

Example: 'MaxEpochs',20

Data Types: single | double

Size of the mini-batch to use for each training iteration, specified as the comma-separated pair consisting of 'MiniBatchSize' and an integer value. A mini-batch is a subset of the training set that is used to evaluate the gradient of the loss function and update the weights. See Stochastic Gradient Descent with Momentum.

Example: 'MiniBatchSize',256

Data Types: single | double

Contribution of the gradient step from the previous iteration to the current iteration of the training, specified as the comma-separated pair consisting of 'Momentum' and a scalar value from 0 to 1. A value of 0 means no contribution from the previous step, whereas a value of 1 means maximal contribution from the previous step.

Example: 'Momentum',0.8

Data Types: single | double

Indicator for data shuffle, specified as the comma-separated pair consisting of 'Shuffle' and one of the following:

• 'once' — The software shuffles the data once before training

• 'never' — The software does not shuffle the data

Example: 'Shuffle','never'

Indicator to display the information about the training progress in the command window, specified as the comma-separated pair consisting of 'Verbose' and either 1 (true) or 0 (false).

The displayed information includes the number of epochs, number of iterations, time elapsed, mini-batch accuracy, and base learning rate. When training a regression network, RMSE is shown instead of accuracy.

Example: 'Verbose',0

Data Types: logical

Number of iterations between printing to the command window. Only has an effect if 'Verbose' is set to true.

Data Types: single | double

Relative division of load between workers of GPUs or CPUs for the 'ExecutionEnvironment','multi-gpu' or 'ExecutionEnvironment','parallel' options, specified as a numeric vector. This vector must contain one value per worker in the parallel pool. For a vector $w$, each worker gets ${w}_{i}/\sum _{i}{w}_{i}$ of the work. Use this option to balance the workload between unevenly performing hardware.

Data Types: double

Custom output functions to call during training, specified as a function handle or cell array of function handles. After each iteration, trainNetwork calls the specified functions and passes a struct containing information from the current iteration via the following fields.

FieldDescription
EpochCurrent epoch number
IterationCurrent iteration number
TimeSinceStartTime in seconds since the start of training
TrainingLossCurrent mini-batch loss
BaseLearnRateCurrent base learning rate
TrainingAccuracyAccuracy of current mini batch (for classification networks)
TrainingRMSE (Regression network)RMSE of the current mini-batch (for regression networks)
StateCurrent training state. (Possible values are "start", "iteration", or "done".)

You can use custom output functions to display or plot progress information, or to stop training early. For an example showing how to plot training accuracy during training, see Plot Training Accuracy During Network Training. To stop training early, the function must return true. For an example showing how to stop training early, see Plot Progress and Stop Training at Specified Accuracy.

Data Types: function_handle | cell

## Output Arguments

collapse all

Training options returned as an object.

For the sgdm training solver, options is a TrainingOptionsSGDM object.

## Algorithms

collapse all

### Initial Weights and Biases

The default for the initial weights is a Gaussian distribution with a mean of 0 and a standard deviation of 0.01. The default for the initial bias value is 0. You can manually change the initialization for the weights and biases. See Specify Initial Weight and Biases in Convolutional Layer and Specify Initial Weight and Biases in Fully Connected Layer.

### Stochastic Gradient Descent with Momentum

The gradient descent algorithm updates the parameters (weights and biases) so as to minimize the error function by taking small steps in the direction of the negative gradient of the loss function [1]:

${\theta }_{\ell +1}={\theta }_{\ell }-\alpha \nabla E\left({\theta }_{\ell }\right),$

where $\ell$ stands for the iteration number, $\alpha >0$ is the learning rate, $\theta$ is the parameter vector, and $E\left(\theta \right)$ is the loss function. The gradient of the loss function, $\nabla E\left(\theta \right)$, is evaluated using the entire training set, and the standard gradient descent algorithm uses the entire data set at once. The stochastic gradient descent algorithm evaluates the gradient, hence updates the parameters, using a subset of the training set. This subset is called a mini-batch.

Each evaluation of the gradient using the mini-batch is an iteration. At each iteration, the algorithm takes one step towards minimizing the loss function. The full pass of the training algorithm over the entire training set using mini-batches is an epoch. You can specify the mini-batch size and the maximum number of epochs using the MiniBatchSize and MaxEpochs name-value pair arguments, respectively.

The gradient descent algorithm might oscillate along the steepest descent path to the optimum. Adding a momentum term to the parameter update is one way to prevent this oscillation [2]. The SGD update with momentum is

${\theta }_{\ell +1}={\theta }_{\ell }-\alpha \nabla E\left({\theta }_{\ell }\right)+\gamma \left({\theta }_{\ell }-{\theta }_{\ell -1}\right),$

where $\gamma$ determines the contribution of the previous gradient step to the current iteration. You can specify this value using the Momentum name-value pair argument.

By default, the software shuffles the data once before training. You change this setting using the Shuffle name-value pair argument.

### L2 Regularization

Adding a regularization term for the weights to the loss function $E\left(\theta \right)$ is one way to reduce overfitting, hence the complexity of a neural network [1], [2]. The regularization term is also called weight decay. The loss function with the regularization term takes the form

${E}_{R}\left(\theta \right)=E\left(\theta \right)+\lambda \Omega \left(w\right),$

where $w$ is the weight vector, $\lambda$ is the regularization factor (coefficient), and the regularization function, $\Omega \left(w\right)$ is:

$\Omega \left(w\right)=\frac{1}{2}{w}^{T}w.$

Note that the biases are not regularized [2]. You can specify the regularization factor, $\lambda$, using the L2Regularization name-value pair argument.

### Save Checkpoint Networks and Resume Training

trainNetwork enables you to save checkpoint networks as .mat files during training. You can then resume training from any of these checkpoint networks. If you want trainNetwork to save checkpoint networks, then you must specify the name of the path using the CheckpointPath name-value pair argument in the call to trainingOptions. If the path you specify is wrong, then trainingOptions returns an error.

trainNetwork automatically assigns unique names to these checkpoint network files. For example, convnet_checkpoint__351__2016_11_09__12_04_23.mat, where 351 is the iteration number, 2016_11_09 is the date and 12_04_21 is the time trainNetwork saves the network. You can load any of these by double clicking on them or typing, for example,

in the command line. You can then resume training by using the layers of this network in the call to trainNetwork, for example,

trainNetwork(Xtrain,Ytrain,net.Layers,options)
You must manually specify the training options and the input data as the checkpoint network does not contain this information.

## References

[1] Bishop, C. M. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.

[2] Murphy, K. P. Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, 2012.