Plugin Scripts for Generic Schedulers
The generic scheduler interface provides complete flexibility to configure the interaction of the MATLAB® client, MATLAB workers, and a third-party scheduler. The plugin scripts define how MATLAB interacts with your setup.
The following table lists the supported plugin script functions and the stage at which they are evaluated:
File Name | Stage |
independentSubmitFcn.m | Submitting an independent job |
communicatingSubmitFcn.m | Submitting a communicating job |
getJobStateFcn.m | Querying the state of a job |
cancelJobFcn.m | Canceling a job |
cancelTaskFcn.m | Canceling a task |
deleteJobFcn.m | Deleting a job |
deleteTaskFcn.m | Deleting a task |
postConstructFcn.m | After creating a parallel.cluster.Generic instance
|
These plugin scripts are evaluated only if they have the expected file name and are located in the folder specified by the PluginScriptsLocation property of the cluster. For more information about how to configure a generic cluster profile, see Configure Using the Generic Scheduler Interface (MATLAB Parallel Server).
Note
The independentSubmitFcn.m
must exist to submit an independent
job, and the communicatingSubmitFcn.m
must exist to submit a
communicating job.
Sample Plugin Scripts
To support usage of the generic scheduler interface, plugin scripts for the following third-party schedulers are available to download from GitHub® repositories:
If the MATLAB client is unable to directly submit jobs to the scheduler, MATLAB supports the use of the ssh
protocol to submit
commands to a remote cluster.
If the client and the cluster nodes do not have a shared file system, MATLAB supports the use of sftp
(SSH File Transfer
Protocol) to copy job and task files between your computer and the cluster.
If you want to customize the behavior of the plugin scripts, you can set
additional properties, such as AdditionalSubmitArgs
. For more
information, see Customize Behavior of Sample Plugin Scripts (MATLAB Parallel Server).
If your scheduler or cluster configuration is not supported by one of the repositories, it is recommended that you modify the scripts of one of these packages. For more information on how to write a set of plugin scripts for generic schedulers, see Plugin Scripts for Generic Schedulers.
Wrapper Scripts
The sample plugin scripts use wrapper scripts to simplify
the implementation of independentSubmitFcn.m
and
communicatingSubmitFcn.m
. These scripts are not required,
however, using them is a good practice to make your code more readable. This
table describes these scripts:
File name | Description |
independentJobWrapper.sh | Used in independentSubmitFcn.m to embed a
call to the MATLAB executable with the appropriate arguments. It uses
environment variables for the location of the executable and its
arguments. For an example of its use, see Sample script for a SLURM scheduler. |
communicatingJobWrapper.sh | Used in communicatingSubmitFcn.m to
distribute a communicating job in your cluster. This script
implements the steps in Submit scheduler job to launch MPI process. For an example of its use, see Sample script for a SLURM scheduler. |
Writing Custom Plugin Scripts
Note
When writing your own plugin scripts, it is a good practice to start by modifying one of the sample plugin scripts that most closely matches your setup (see Sample Plugin Scripts).
independentSubmitFcn
When you submit an independent job to a generic cluster, the
independentSubmitFcn.m
function executes in the
MATLAB client session.
The declaration line of this function must be:
function independentSubmitFcn(cluster,job,environmentProperties)
Each task in a MATLAB independent job corresponds to a single job on your scheduler. The
purpose of this function is to submit N
jobs to your
third-party scheduler, where N
is the number of tasks in the
independent job. Each of these jobs must:
Set the five environment variables required by the worker MATLAB to identify the individual task to run. For more information, see Configure the worker environment.
Call the appropriate MATLAB executable to start the MATLAB worker and run the task. For more information, see Submit scheduler jobs to run MATLAB workers.
Configure the worker environment. This table identifies the five environment variables and values that must be set on the worker MATLAB to run an individual task:
Environment Variable Name | Environment Variable Value |
PARALLEL_SERVER_DECODE_FUNCTION | 'parallel.cluster.generic.independentDecodeFcn' |
PARALLEL_SERVER_STORAGE_CONSTRUCTOR | environmentProperties.StorageConstructor |
PARALLEL_SERVER_STORAGE_LOCATION |
|
PARALLEL_SERVER_JOB_LOCATION | environmentProperties.JobLocation |
PARALLEL_SERVER_TASK_LOCATION | environmentProperties.TaskLocations{n}
for the nth task |
Many schedulers support copying the client environment as part of the submission command. If so, you can set the previous environment variables in the client, so the scheduler can copy them to the worker environment. If not, you must modify your submission command to forward these variables.
Submit scheduler jobs to run MATLAB workers. Once the five required parameters for a given job and task are defined on
a worker, the task is run by calling the MATLAB executable with suitable arguments. The MATLAB executable to call is defined in
environmentProperties.MatlabExecutable
. The arguments
to pass are defined in
environmentProperties.MatlabArguments
.
Note
If you cannot submit directly to your scheduler from the client
machine, see Submitting from a Remote Host
for instructions on how to submit using ssh
.
Sample script for a SLURM scheduler. This script shows a basic submit function for a SLURM scheduler with a shared file system. For a more complete example, see Sample Plugin Scripts.
function independentSubmitFcn(cluster,job,environmentProperties) % Specify the required environment variables. setenv('PARALLEL_SERVER_DECODE_FUNCTION', 'parallel.cluster.generic.independentDecodeFcn'); setenv('PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor); setenv('PARALLEL_SERVER_STORAGE_LOCATION', environmentProperties.StorageLocation); setenv('PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation); % Specify the MATLAB executable and arguments to run on the worker. % These are used in the independentJobWrapper.sh script. setenv('PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable); setenv('PARALLEL_SERVER_MATLAB_ARGS', environmentProperties.MatlabArguments); for ii = 1:environmentProperties.NumberOfTasks % Specify the environment variable required to identify which task to run. setenv('PARALLEL_SERVER_TASK_LOCATION', environmentProperties.TaskLocations{ii}); % Specify the command to submit the job to the SLURM scheduler. % SLURM will automatically copy environment variables to workers. commandToRun = 'sbatch --ntasks=1 independentJobWrapper.sh'; [cmdFailed, cmdOut] = system(commandToRun); end end
The previous example submits a simple bash script,
independentJobWrapper.sh
, to the scheduler. The
independentJobWrapper.sh
script embeds the
MATLAB executable and arguments using environment variables:
#!/bin/sh # PARALLEL_SERVER_MATLAB_EXE - the MATLAB executable to use # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use exec "${PARALLEL_SERVER_MATLAB_EXE}" ${PARALLEL_SERVER_MATLAB_ARGS}
communicatingSubmitFcn
When you submit a communicating job to a generic cluster, the
communicatingSubmitFcn.m
function executes in the
MATLAB client session.
The declaration line of this function must be:
function communicatingSubmitFcn(cluster,job,environmentProperties)
The purpose of this function is to submit a single job to your scheduler. This job must:
Set the four environment variables required by the MATLAB workers to identify the job to run. For more information, see Configure the worker environment.
Call MPI to distribute your job to
N
MATLAB workers.N
corresponds to the maximum value specified in theNumWorkersRange
property of the MATLAB job. For more information, see Submit scheduler job to launch MPI process.
Configure the worker environment. This table identifies the four environment variables and values that must be set on the worker MATLAB to run a task of a communicating job:
Environment Variable Name | Environment Variable Value |
PARALLEL_SERVER_DECODE_FUNCTION | 'parallel.cluster.generic.communicatingDecodeFcn' |
PARALLEL_SERVER_STORAGE_CONSTRUCTOR | environmentProperties.StorageConstructor |
PARALLEL_SERVER_STORAGE_LOCATION |
|
PARALLEL_SERVER_JOB_LOCATION | environmentProperties.JobLocation |
Many schedulers support copying the client environment as part of the submission command. If so, you can set the previous environment variables in the client, so the scheduler can copy them to the worker environment. If not, you must modify your submission command to forward these variables.
Submit scheduler job to launch MPI process. After you define the four required parameters for a given job, run your
job by launching N
worker MATLAB processes using mpiexec
.
mpiexec
is software shipped with the Parallel Computing Toolbox™ that implements the Message Passing Interface (MPI) standard
to allow communication between the worker MATLAB processes. For more information about
mpiexec
, see the MPICH home page.
To run your job, you must submit a job to your scheduler, which executes
the following steps. Note that matlabroot
refers to the
MATLAB installation location on your worker nodes.
Request
N
processes from the scheduler.N
corresponds to the maximum value specified in theNumWorkersRange
property of the MATLAB job.Call
mpiexec
to start worker MATLAB processes. The number of worker MATLAB processes to start on each host should match the number of processes allocated by your scheduler. Thempiexec
executable is located atmatlabroot/bin/mw_mpiexec
.The
mpiexec
command automatically forwards environment variables to the launched processes. Therefore, ensure the environment variables listed in Configure the worker environment are set before runningmpiexec
.To learn more about options for
mpiexec
, see Using the Hydra Process Manager.
Note
For a complete example of the previous steps, see the
communicatingJobWrapper.sh
script provided
with any of the sample plugin scripts in Sample Plugin Scripts. Use this script as a starting point if you need to write your
own script.
Sample script for a SLURM scheduler. The following script shows a basic submit function for a SLURM scheduler with a shared file system.
The submitted job is contained in a bash script,
communicatingJobWrapper.sh
. This script implements
the relevant steps in Submit scheduler job to launch MPI process for a
SLURM scheduler. For a more complete example, see Sample Plugin Scripts.
function communicatingSubmitFcn(cluster,job,environmentProperties) % Specify the four required environment variables. setenv('PARALLEL_SERVER_DECODE_FUNCTION', 'parallel.cluster.generic.communicatingDecodeFcn'); setenv('PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor); setenv('PARALLEL_SERVER_STORAGE_LOCATION', environmentProperties.StorageLocation); setenv('PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation); % Specify the MATLAB executable and arguments to run on the worker. % Specify the location of the MATLAB install on the cluster nodes. % These are used in the communicatingJobWrapper.sh script. setenv('PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable); setenv('PARALLEL_SERVER_MATLAB_ARGS', environmentProperties.MatlabArguments); setenv('PARALLEL_SERVER_CMR', cluster.ClusterMatlabRoot); numberOfTasks = environmentProperties.NumberOfTasks; % Specify the command to submit a job to the SLURM scheduler which % requests as many processes as tasks in the job. % SLURM will automatically copy environment variables to workers. commandToRun = sprintf('sbatch --ntasks=%d communicatingJobWrapper.sh', numberOfTasks); [cmdFailed, cmdOut] = system(commandToRun); end
getJobStateFcn
When you query the state of a job created with a generic cluster, the
getJobStateFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function state = getJobStateFcn(cluster,job,state)
When using a third-party scheduler, it is possible that the scheduler can have more up-to-date information about your jobs than what is available to the toolbox from the local job storage location. This situation is especially true if using a nonshared file system, where the remote file system could be slow in propagating large data files back to your local data location.
To retrieve that information from the scheduler, add a function called
getJobStateFcn.m
to the
PluginScriptsLocation
of your
cluster.
The state passed into this function is the state derived from the local job
storage. The body of this function can then query the scheduler to determine a
more accurate state for the job and return it in place of the stored state. The
function you write for this purpose must return a valid value for the state of a
job object. Allowed values are ‘pending'
,
‘queued'
, ‘running'
,
‘finished'
, or ‘failed'
.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
cancelJobFcn
When you cancel a job created with a generic cluster, the
cancelJobFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function OK = cancelJobFcn(cluster,job)
When you cancel a job created using the generic scheduler interface, by
default this action affects only the job data in storage. To cancel the
corresponding jobs on your scheduler, you must provide instructions on what to
do and when to do it to the scheduler. To achieve this, add a function called
cancelJobFcn.m
to the
PluginScriptsLocation
of your
cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding jobs from the queue. The function must return a logical scalar indicating the success or failure of canceling the jobs on the scheduler.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
cancelTaskFcn
When you cancel a task created with a generic cluster, the
cancelTaskFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function OK = cancelTaskFcn(cluster,task)
When you cancel a task created using the generic scheduler interface, by
default, this affects only the task data in storage. To cancel the corresponding
job on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
cancelTaskFcn.m
to the
PluginScriptsLocation
of your
cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding job from the scheduler queue. The function must return a logical scalar indicating the success or failure of canceling the job on the scheduler.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
deleteJobFcn
When you delete a job created with a generic cluster, the
deleteJobFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function deleteTaskFcn(cluster,task)
When you delete a job created using the generic scheduler interface, by
default, this affects only the job data in storage. To remove the corresponding
jobs on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
deleteJobFcn.m
to the
PluginScriptsLocation
of your
cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding jobs from the scheduler queue.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
deleteTaskFcn
When you delete a task created with a generic cluster, the
deleteTaskFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function deleteTaskFcn(cluster,task)
When you delete a task created using the generic scheduler interface, by
default, this affects only the task data in storage. To remove the corresponding
job on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
deleteTaskFcn.m
to the
PluginScriptsLocation
of your
cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding job from the scheduler queue.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
postConstructFcn
After you create an instance of your cluster in MATLAB, the postConstructFcn.m
function executes in
the MATLAB client session. For example, the following line of code creates an
instance of your cluster and runs the postConstructFcn
function associated with the ‘myProfile'
cluster
profile:
c = parcluster('myProfile');
The declaration line of the postConstructFcn
function must
be:
function postConstructFcn(cluster)
If you need to perform custom configuration of your cluster before its use,
add a function called postConstructFcn.m
to the
PluginScriptsLocation
of your cluster. The body of this
function can contain any extra setup steps you require.
Adding User Customization
If you need to modify the functionality of your plugin scripts at run time, then
use the AdditionalProperties
property of
the generic scheduler interface.
As an example, consider the SLURM scheduler. The submit command for SLURM accepts
a –-nodelist
argument that allows you to specify the nodes you
want to run on. You can change the value of this argument without having to modify
your plugin scripts. To add this functionality, include the following code pattern
in either your independentSubmitFcn.m
or
communicatingSubmitFcn.m
script:
% Basic SLURM submit command submitCommand = 'sbatch'; % Check if property is defined if isprop(cluster.AdditionalProperties, 'NodeList') % Add appropriate argument and value to submit string submitCommand = [submitCommand ' --nodelist=' cluster.AdditionalProperties.NodeList]; end
For an example of how to use this coding pattern, see the nonshared submit functions of the scripts in Sample Plugin Scripts.
Alternatively, to modify the submit command for both independent and communication
jobs, include the code pattern above in your getCommonSubmitArgs
function. The getCommonSubmitArgs
function is a helper function
included in the sample plugin scripts that you can use to modify the submit command
for both types of jobs.
Setting AdditionalProperties
from the Cluster Profile Manager
With the modification to your scripts in the previous example, you can add an
AdditionalProperties
entry to your generic cluster profile to specify a list of nodes to use. This
provides a method of documenting customization added to your plugin scripts for
anyone you share the cluster profile with.
To add the NodeList
property to your cluster
profile:
Start the Cluster Profile Manager from the MATLAB desktop by selecting Parallel > Create and Manage Cluster Profiles.
Select the profile for your generic cluster, and click Edit.
Navigate to the AdditionalProperties table, and click Add.
Enter
NodeList
as the Name.Set String as the Type.
Set the Value to the list of nodes.
Setting AdditionalProperties from the MATLAB Command Line
With the modification to your scripts in Adding User Customization, you can edit the list of nodes from the MATLAB command line by setting the appropriate property of the cluster object before submitting a job:
c = parcluster; c.AdditionalProperties.NodeList = 'gpuNodeName'; j = c.batch('myScript');
Display the AdditionalProperties
object to see all currently
defined properties and their
values:
>> c.AdditionalProperties ans = AdditionalProperties with properties: ClusterHost: 'myClusterHost' NodeList: 'gpuNodeName' RemoteJobStorageLocation: '/tmp/jobs'
Managing Jobs with Generic Scheduler
The first requirement for job management is to identify the jobs on the scheduler
corresponding to a MATLAB job object. When you submit a job to the scheduler, the command that
does the submission in your submit function can return some data about the job from
the scheduler. This data typically includes a job ID. By storing that scheduler job
ID with the MATLAB job object, you can later refer to the scheduler job by this job ID
when you send management commands to the scheduler. Similarly, you can store a map
of MATLAB task IDs to scheduler job IDs to help manage individual tasks. The
toolbox function that stores this cluster data is setJobClusterData
.
Save Job Scheduler Data
This example shows how to modify the independentSubmitFcn.m
function to parse the output of each command submitted to a SLURM scheduler. You
can use regular expressions to extract the scheduler job ID for each task and
then store it using setJobClusterData
.
% Pattern to extract scheduler job ID from SLURM sbatch output searchPattern = '.*Submitted batch job ([0-9]+).*'; jobIDs = cell(numberOfTasks, 1); for ii = 1:numberOfTasks setenv('PARALLEL_SERVER_TASK_LOCATION', environmentProperties.TaskLocations{ii}); commandToRun = 'sbatch --ntasks=1 independentJobWrapper.sh'; [cmdFailed, cmdOut] = system(commandToRun); jobIDs{ii} = regexp(cmdOut, searchPattern, 'tokens', 'once'); end % set the job IDs on the job cluster data cluster.setJobClusterData(job, struct('ClusterJobIDs', {jobIDs}));
Retrieve Job Scheduler Data
This example modifies the cancelJobFcn.m
to cancel the
corresponding jobs on the SLURM scheduler. The example uses getJobClusterData
to retrieve
job scheduler
data.
function OK = cancelJobFcn(cluster, job) % Get the scheduler information for this job data = cluster.getJobClusterData(job); jobIDs = data.ClusterJobIDs; for ii = 1:length(jobIDs) % Tell the SLURM scheduler to cancel the job commandToRun = sprintf('scancel ''%s''', jobIDs{ii}); [cmdFailed, cmdOut] = system(commandToRun); end OK = true;
Submitting from a Remote Host
If the MATLAB client is unable to submit directly to your scheduler, use parallel.cluster.RemoteClusterAccess
to establish a connection and run commands on a remote host.
This object uses the ssh
protocol, and hence requires an ssh
daemon service running on the remote host. To establish a connection, you must either have an ssh
agent running on your machine, or provide one of the following:
A user name and password
A valid identity file
Proper responses for multifactor authentication
The following code executes a command on a remote host,
remoteHostname
, as the user,
user
.
% This will prompt for the password of user access = parallel.cluster.RemoteClusterAccess.getConnectedAccess... ('remoteHostname', 'user'); % Execute a command on remoteHostname [cmdFailed, cmdOut] = access.runCommand(commandToRun);
For an example of plugin scripts using remote host submission, see the remote submission mode in Sample Plugin Scripts.
Submitting Without a Shared File System
If the MATLAB client does not have a shared file system with the cluster nodes, use
parallel.cluster.RemoteClusterAccess
to establish a connection and copy job and task files between the client and cluster
nodes.
This object uses the ssh
protocol, and hence requires an ssh
daemon service running on the remote host. To establish a connection, you must either have an ssh
agent running on your machine, or provide one of the following:
A user name and password
A valid identity file
Proper responses for multifactor authentication
When using nonshared submission, you must specify both a local job storage location to use on the client and a remote job storage location to use on the cluster. The remote job storage location must be available to all nodes of the cluster.
parallel.cluster.RemoteClusterAccess
uses file mirroring to
continuously synchronize the local job and task files with those on the cluster.
When file mirroring first starts, local job and task files are uploaded to the
remote job storage location. As the job executes, the file mirroring continuously
checks the remote job storage location for new files and updates, and copies the
files to the local storage on the client. This procedure ensures the MATLAB client always has an up-to-date view of the jobs and tasks executing
on the scheduler.
This example connects to the remote host, remoteHostname
, as
the user, user
, and establishes
/remote/storage
as the remote cluster storage location to
synchronize with. It then starts file mirroring for a job, copying the local files
of the job to /remote/storage
on the cluster, and then syncing
any changes back to the local
machine.
% This will prompt for the password of user access = parallel.cluster.RemoteClusterAccess.getConnectedAccessWithMirror... ('remoteHostname', '/remote/storage', 'user'); % Start file mirroring for job access.startMirrorForJob(job);
For an example of plugin scripts without a shared file system, see the nonshared submission mode in Sample Plugin Scripts.
Related Topics
- Configure Using the Generic Scheduler Interface (MATLAB Parallel Server)