Main Content

Deep Learning with Big Data on GPUs and in Parallel

Training deep networks is computationally intensive; however, neural networks are inherently parallel algorithms. You can usually accelerate training of convolutional neural networks by distributing training in parallel across multicore CPUs, high-performance GPUs, and clusters with multiple CPUs and GPUs. Using GPU or parallel options requires Parallel Computing Toolbox™.


GPU support is automatic if you have Parallel Computing Toolbox. By default, the trainNetwork function uses a GPU if available.

If you have access to a machine with multiple GPUs, then simply specify the training option 'ExecutionEnvironment','multi-gpu'.

You do not need multiple computers to solve problems using data sets too large to fit in memory. You can use the augmentedImageDatastore function to work with batches of data without needing a cluster of machines. For an example, see Train Network with Augmented Images. However, if you have a cluster available, it can be helpful to take your code to the data repository rather than moving large amounts of data around.

Deep Learning Hardware and Memory ConsiderationsRecommendationsRequired Products
Data too large to fit in memoryTo import data from image collections that are too large to fit in memory, use the augmentedImageDatastore function. This function is designed to read batches of images for faster processing in machine learning and computer vision applications.


Deep Learning Toolbox™

CPUIf you do not have a suitable GPU, then you can train on a CPU instead. By default, the trainNetwork function uses the CPU if no GPU is available.


Deep Learning Toolbox

GPUBy default, the trainNetwork function uses a GPU if available. Requires a supported GPU device. For information on supported devices, see GPU Support by Release (Parallel Computing Toolbox). Check your GPU using gpuDevice. Specify the execution environment using the trainingOptions function.


Deep Learning Toolbox

Parallel Computing Toolbox

Parallel on your local machine using multiple GPUs or CPU coresTake advantage of multiple workers by specifying the execution environment with the trainingOptions function. If you have more than one GPU on your machine, specify 'multi-gpu'. Otherwise, specify 'parallel'.


Deep Learning Toolbox

Parallel Computing Toolbox

Parallel on a cluster or in the cloudScale up to use workers on clusters or in the cloud to accelerate your deep learning computations. Use trainingOptions and specify 'parallel' to use a compute cluster. For more information, see Deep Learning in the Cloud.


Deep Learning Toolbox

Parallel Computing Toolbox

MATLAB Parallel Server™

When you train a network using the trainNetwork function, or when you use prediction or validation functions with DAGNetwork and SeriesNetwork objects, the software performs these computations using single-precision, floating-point arithmetic. Functions for training, prediction, and validation include trainNetwork, predict, classify, and activations. The software uses single-precision arithmetic when you train networks using both CPUs and GPUs.

Because single-precision and double-precision performance of GPUs can differ substantially, it is important to know in which precision computations are performed. If you only use a GPU for deep learning, then single-precision performance is one of the most important characteristics of a GPU. If you also use a GPU for other computations using Parallel Computing Toolbox, then high double-precision performance is important. This is because many functions in MATLAB use double-precision arithmetic by default. For more information, see Improve Performance Using Single Precision Calculations (Parallel Computing Toolbox).

Training with Multiple GPUs

MATLAB supports training a single network using multiple GPUs in parallel. This can be achieved using multiple GPUs on your local machine, or on a cluster or cloud with workers with GPUs. To speed up training using multiple GPUs, try increasing the mini-batch size and learning rate.

Convolutional neural networks are typically trained iteratively using batches of images. This is done because the whole dataset is too large to fit into GPU memory. For optimum performance, you can experiment with the MiniBatchSize option that you specify with the trainingOptions function.

The optimal batch size depends on your exact network, dataset, and GPU hardware. When training with multiple GPUs, each image batch is distributed between the GPUs. This effectively increases the total GPU memory available, allowing larger batch sizes. Because it improves the significance of each batch, you can increase the learning rate. A good general guideline is to increase the learning rate proportionally to the increase in batch size. Depending on your application, a larger batch size and learning rate can speed up training without a decrease in accuracy, up to some limit.

Using multiple GPUs can speed up training significantly. To decide if you expect multi-GPU training to deliver a performance gain, consider the following factors:

  • How long is the iteration on each GPU? If each GPU iteration is short, then the added overhead of communication between GPUs can dominate. Try increasing the computation per iteration by using a larger batch size.

  • Are all the GPUs on a single machine? Communication between GPUs on different machines introduces a significant communication delay. You can mitigate this if you have suitable hardware. For more information, see Advanced Support for Fast Multi-Node GPU Communication.

To learn more, see Scale Up Deep Learning in Parallel and in the Cloud and Select Particular GPUs to Use for Training.

Deep Learning in the Cloud

If you do not have a suitable GPU available for faster training of a convolutional neural network, you can try your deep learning applications with multiple high-performance GPUs in the cloud, such as on Amazon® Elastic Compute Cloud (Amazon EC2®). MATLAB Deep Learning Toolbox provides examples that show you how to perform deep learning in the cloud using Amazon EC2 with P2 or P3 machine instances and data stored in the cloud.

You can accelerate training by using multiple GPUs on a single machine or in a cluster of machines with multiple GPUs. Train a single network using multiple GPUs, or train multiple models at once on the same data.

For more information on the complete cloud workflow, see Deep Learning in Parallel and in the Cloud.

Fetch and Preprocess Data in Background

When training a network in parallel, you can fetch and preprocess data in the background. To perform data dispatch in the background, enable background dispatch in the mini-batch datastore used by trainNetwork. You can use a built-in mini-batch datastore, such as augmentedImageDatastore, denoisingImageDatastore (Image Processing Toolbox), or pixelLabelImageDatastore (Computer Vision Toolbox). You can also use a custom mini-batch datastore with background dispatch enabled. For more information on creating custom mini-batch datastores, see Develop Custom Mini-Batch Datastore.

To enable background dispatch, set the DispatchInBackground property of the datastore to true.

You can fine-tune the training computation and data dispatch loads between workers by specifying the 'WorkerLoad' name-value pair argument of trainingOptions. For advanced options, you can try modifying the number of workers of the parallel pool. For more information, see Specify Your Parallel Preferences (Parallel Computing Toolbox)

See Also


Related Topics