Main Content

Send Deep Learning Batch Job to Cluster

This example shows how to send deep learning training batch jobs to a cluster so that you can continue working or close MATLAB® during training.

Training deep neural networks often takes hours or days. To use time efficiently, you can train neural networks as batch jobs and fetch the results from the cluster when they are ready. You can continue working in MATLAB while computations take place or close MATLAB and obtain the results later using the Job Monitor. You can optionally monitor the jobs during training and, after the job is complete, you can fetch the trained networks and compare their accuracies.

Requirements

Before you can run this example, you need to configure a cluster and upload your data to the Cloud. In MATLAB, you can create clusters in the cloud directly from the MATLAB Desktop. On the Home tab, in the Parallel menu, select Create and Manage Clusters. In the Cluster Profile Manager, click Create Cloud Cluster. Alternatively, you can use MathWorks Cloud Center to create and access compute clusters. For more information, see Getting Started with Cloud Center. For this example, ensure that your desired cloud cluster is set as the default parallel environment on the MATLAB Home tab, in Parallel > Select Parallel Environment. After that, upload your data to an Amazon S3 bucket and use it directly from MATLAB. This example uses a copy of the CIFAR-10 data set that is already stored in Amazon S3. For instructions, see Work with Deep Learning Data in AWS.

Submit Batch Job

You can send a function or a script as a batch job to the cluster by using the batch (Parallel Computing Toolbox) function. By default, the cluster allocates one worker to execute the contents of the job. If the code in the job will benefit from extra workers, for example, it includes automatic parallel support or a parfor-loop, you can specify more workers by using the Pool name-value argument of the batch function.

When you submit a batch job as a script, by default, workspace variables are copied from the client to the workers. To avoid copying workspace variables to the workers, submit batch jobs as functions.

The trainConvNet function is provided as a supporting file with this example. To access the function, open the example as a live script. The function trains a single network using a given mini-batch size and returns the trained network and its accuracy. To perform a parameter sweep across mini-batch sizes, send the function as a batch job to the cluster four times, specifying a different mini-batch sizes for each job. When sending a function as a batch job, specify the number of outputs of the function and the input arguments.

c = parcluster("MyClusterInTheCloud");
miniBatchSize = [64 128 256 512];
numBatchJobs = numel(miniBatchSize);

for idx=1:numBatchJobs
    job(idx) = batch(c,"trainConvNet",2,{idx,miniBatchSize(idx)});
end

Training each network in an individual batch job instead of using a single batch job that trains all of the networks in parallel avoids the overhead required to start a parallel pool in the cluster and allows you to use the job monitor to observe the progress of each network computation individually.

You can submit additional jobs to the cluster. If the cluster is not available because it is running other jobs, any new job you submit remains queued until the cluster becomes available.

Monitor Training Progress

You can see the current status of your job in the cluster by checking the Job Monitor. In the Environment section on the Home tab, select Parallel > Monitor Jobs to open the Job Monitor.

You can optionally monitor the progress of training in detail by sending data from the workers running the batch jobs to the MATLAB client. In the trainConvNet function, the output function sendTrainingProgress is called after each iteration to add the current iteration and training accuracy to a ValueStore (Parallel Computing Toolbox). A ValueStore stores data owned by a specific job and each data entry consists of a value and a corresponding key.

function stop = sendTrainingProgress(info)

    if info.State == "iteration" && ~isempty(info.TrainingAccuracy)
        % Get the ValueStore object of the current job.
        store = getCurrentValueStore;

        % Store the training results in the job ValueStore object with a unique
        % key.
        key = idx;
        store(key) = struct(iteration=info.Iteration,accuracy=info.TrainingAccuracy);
    end
    stop = false;

end

Create a figure for displaying the training accuracy of the networks and, for each job submitted:

  • Create a subplot to display the accuracy of the network being trained.

  • Get the ValueStore object of the job.

  • Specify a callback function to execute each time the job adds an entry to the ValueStore. The callback function updatePlot is provided at the end of this example and plots the current training accuracy of a network.

figure
for i=1:numBatchJobs
    subplot(2,2,i)
    xlabel("Iteration");
    ylabel("Accuracy (%)");
    ylim([0 100])
    lines(i) = animatedline;

    store{i} = job(i).ValueStore;
    store{i}.KeyUpdatedFcn = @(store,key) updatePlot(lines(i),store(key).iteration,store(key).accuracy);
end

Fetch Results Programmatically

After submitting jobs to the cluster, you can continue working in MATLAB while computations take place. If the rest of your code depends on completion of a job, block MATLAB by using the wait command. In this case, wait for the job to finish.

wait(job(1))

After the job finishes, fetch the results by using the fetchOutputs function. In this case, fetch the trained networks and their accuracies.

for idx=1:numBatchJobs
    results{idx}=fetchOutputs(job(idx));
end
results{:}
ans=1×2 cell array
    {1×1 dlnetwork}    {[0.6866]}

ans=1×2 cell array
    {1×1 dlnetwork}    {[0.5964]}

ans=1×2 cell array
    {1×1 dlnetwork}    {[0.6542]}

ans=1×2 cell array
    {1×1 dlnetwork}    {[0.6230]}

If you close MATLAB, you can still recover the jobs in the cluster to fetch the results either while the computation is taking place or after the computation is complete. Before closing MATLAB, make a note of the job ID and then retrieve the job later by using the findJob function.

To retrieve a job, first create a cluster object for your cluster by using the parcluster function. Then, provide the job ID to findJob. In this case, the job ID is 3.

c = parcluster("MyClusterInTheCloud");
job = findJob(c,ID=3);

Delete a job when you no longer need it. The job is removed from the Job Monitor.

delete(job(1));

To delete all jobs submitted to a particular cluster, pass all jobs associated with the cluster to the delete function.

delete(c.Jobs);

Use Job Monitor to Fetch Results

When you submit batch jobs, all the computations happen in the cluster and you can safely close MATLAB. You can check the status of your jobs by using the Job Monitor in another MATLAB session.

When a job is done, you can retrieve the results from the Job Monitor. In the Environment section on the Home tab, select Parallel > Monitor Jobs to open the Job Monitor. Then right-click a job to display the context menu. From this menu, you can:

  • Load the job into the workspace by clicking Show Details

  • Fetch the trained networks and their accuracies by clicking Fetch Outputs

  • Delete the job when you are done by clicking Delete

Supporting Functions

The updatePlot function adds a point to one of the subplots indicating the current training accuracy of a network. The function receives an animated line object, and the current iteration and accuracy of a network.

function updatePlot(line,iteration,accuracy)

addpoints(line,iteration,accuracy);
drawnow limitrate nocallbacks

end

See Also

(Parallel Computing Toolbox) | (Parallel Computing Toolbox)

Related Examples

More About