Main Content

Offload Experiments as Batch Jobs to Cluster

Since R2022a

By default, Experiment Manager runs your experiments interactively. While an experiment is running, you can monitor the progress of each trial in a table of results and in a training plot. However, running an experiment interactively limits your access to MATLAB® functionality. For example, during training, you cannot close the project that contains the experiment or run other experiments.

If you have Parallel Computing Toolbox™ and MATLAB Parallel Server™, you can send your experiment as a batch job to a remote cluster. While the experiment is running in the cluster, you can:

  • Run another experiment interactively or start another batch job using the same experiment, using a different experiment in the same project, or using an experiment in a different project.

  • Close the Experiment Manager app and continue using MATLAB.

  • Close your MATLAB session.

If you have only Parallel Computing Toolbox, you can use a local cluster profile to develop and test your experiments on your client machine instead of running them on a network cluster. If you close your MATLAB session, any batch jobs that are using the local cluster profile also stop immediately.

Create Batch Job on Cluster

To start a batch job for your experiment:

  1. Configure your experiment, as described in Configure Built-In Training Experiment or Configure Custom Training Experiment.

    Tip

    Load training and validation data from a location that is accessible to all your workers. For example, store your data outside the project and access the data by using an absolute path. Alternatively, create a datastore object that can access the data on another machine by setting up the AlternateFileSystemRoots property of the datastore. For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

  2. In the Experiment Manager toolstrip, under Execution, specify an execution Mode:

    • To run one trial of the experiment at a time, select Batch Sequential. Experiment Manager does not support this execution mode when you set the training option ExecutionEnvironment to "multi-gpu".

    • To run multiple trials at the same time, select Batch Simultaneous. Experiment Manager does not support this execution mode when you set the training option ExecutionEnvironment to "multi-gpu" or "parallel" or when you enable the training option DispatchInBackground.

  3. Use the Cluster list to select a cluster profile to use for your batch job. To create and manage cluster profiles, open the Cluster Profile Manager. For more information, see Discover Clusters and Use Cluster Profiles (Parallel Computing Toolbox).

  4. In the Pool Size field, enter the number of workers for your batch job.

    • In Batch Sequential mode, use this field to configure the number of parallel workers that collaborate on each trial of the experiment. If you set the pool size to 0, the experiment runs on a single worker.

    • In Batch Simultaneous mode, use this field to specify the number of trials that the cluster runs at the same time.

    Because Experiment Manager uses an additional worker to run the batch job, the cluster must have at least one more worker available than the number you specify in the Pool Size field. For example, if you specify a pool size of 2, the cluster must have at least three workers available (two workers for the experiment and an additional worker to run the batch job). For more information, see Run a Batch Job with a Parallel Pool (Parallel Computing Toolbox).

  5. Click Run . Experiment Manager uses the batch (Parallel Computing Toolbox) function to run the experiment in the specified cluster.

While the batch job runs your experiment, you can close Experiment Manager and recover the results later. To monitor batch jobs, use the Job Monitor, as described in Send Deep Learning Batch Job to Cluster.

Job Monitor showing a batch job created with Experiment Manager.

Note

Using the Job Monitor to cancel or delete jobs that you create with Experiment Manager can lead to unexpected behavior. Instead, cancel and delete these batch jobs by using Experiment Manager.

Track Progress of Batch Job

When you start a batch job for an experiment, a table of results displays training and validation metrics (such as RMSE and loss) for each trial. Experiment Manager does not continually communicate with the cluster to update the values in this table. Instead, to retrieve the latest metric values and the training plot for an experiment running on a cluster, click the Refresh button above the results table.

Results table showing Refresh button.

Interrupt Training in Batch Job

To cancel a batch job running an experiment, in the Experiment Manager toolstrip, click Cancel . Experiment Manager marks any running and queued trials as Canceled and discards their results.

Batch execution does not support stopping, canceling, or restarting individual trials of an experiment.

Retrieve Results and Clean Up Data

To download the training results for a completed trial, in the Actions column of the results table, click the Download button for the trial. Experiment Manager saves the training results that you download from the cluster, so you can access them after you close your MATLAB session.

  • For built-in training experiments, Experiment Manager downloads the trained network and training information from the cluster.

  • For custom training experiments, Experiment Manager downloads the training output from the cluster.

Results table showing download button for a completed trial.

After you download the training results from the cluster, you can export these results to the workspace and perform additional computations to evaluate the quality of the training.

  • For built-in training experiments, select Export > Trained Network or Export > Training Information.

  • For custom training experiments, select Export > Training Output.

Once you retrieve all the required results and do not need the job anymore, delete it from the cluster to avoid consuming resources unnecessarily. To delete the batch job and permanently discard the training results, training plots, and confusion matrices for any trials you have not downloaded from the cluster, click the Clean up button above the results table.

See Also

Apps

Functions

  • (Parallel Computing Toolbox)

Related Topics