MATLAB Answers

Submitting batch jobs across multiple nodes using slurm

80 views (last 30 days)
I have a workstation that I am currently using to run the following code structure:
A matlab script that manages everything and iteratively calls a second wrapper function. Within this wrapper, I submit multiple jobs (each one is a model simulation requiring one core) using the batch command, wait for them to all complete, then return some output to the main script. This works fine on my computer running 12 jobs in parallel but each model simulation takes 2-3 hours and I am limited to the number of cores on my machine, ideally I would need to run ~50+ jobs in parallel to get reasonable run times.
I would like to get this working on the university cluster which uses the SLURM workload manager. My problem is that each node on this cluster does not have sufficient cores to get much of a speedup and so I need to submit the job to run on multiple nodes to take full advantage of the resources available. Of course I run into a problem because the main script only needs 1 core and so trying to split this over several nodes makes no sense to slurm and throws an error.
I am very much a beginner with how to use slurm so presumably this is a mistake in how I configure the job submission, the script I am using is as follows:
#!/bin/bash
#SBATCH -J my_script
#SBATCH --output=/scratch/%u/%x-%N-%j.out
#SBATCH --error=/scratch/%u/%x-%N-%j.err
#SBATCH -p 24hour
#SBATCH --cpus-per-task=40
#SBATCH --nodes=2
#SBATCH --tasks=1
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user sebastian.rosier@northumbria.ac.uk
#SBATCH --exclusive
module load MATLAB/R2018a
srun -N 2 -n 1 -c 40 matlab -nosplash -nodesktop -r "my_script; quit;"
The model wrapper that submits multiple batch jobs is something like this:
c = parcluster;
for ii = 1:N
workerTable{ii} = batch(c,'my_model',1,{my_model_opts});
end
with additional lines to check job status and get results etc.
Perhaps what I am trying to do makes no sense and I need to come up with a completely different structure to my MATLAB script. Either way, any help would be much appreciated!
Sebastian

Accepted Answer

Raymond Norris
Raymond Norris on 3 Mar 2021
Edited: Raymond Norris on 4 Mar 2021
Hi Sebastian,
I'm going to assume that my_script is the code "workerTable{ii} = ..."
There are several ways to approach this, but none require that your Slurm job request >1 node.
OPTION #1
As you've written it, you could request 1 node with 40 cores. Use the local profile to submit single core batch jobs on that one node.
#!/bin/bash
#SBATCH -J my_script
#SBATCH --output=/scratch/%u/%x-%N-%j.out
#SBATCH --error=/scratch/%u/%x-%N-%j.err
#SBATCH -p 24hour
#SBATCH --cpus-per-task=40
#SBATCH --nodes=1
#SBATCH --tasks=1
#SBATCH --mail-type=BEGIN,END,FAIL
SBATCH --mail-user sebastian.rosier@northumbria.ac.uk
#SBATCH --exclusive
module load MATLAB/R2018a
matlab -nodesktop -r "my_script; quit"
OPTION #2
Same Slurm script, but modifyed my_script to make it a bit more streamlined (though parfeval isn't much different than your call to batch).
% Start pool
c = parcluster;
sz = str2num(getenv('SLURM_CPUS_PER_TASK'))-1;
if isempty(sz)
sz = maxNumCompThreads-1;
end
p = c.parpool(sz);
parfor ii = 1:N
results{ii} = my_model(my_model_opts);
end
or
% Start pool
c = parcluster;
sz = str2num(getenv('SLURM_CPUS_PER_TASK'))-1;
if isempty(sz)
sz = maxNumCompThreads-1;
end
p = c.parpool(sz);
for ii = 1:N
f(ii) = p.parfeval(@my_model,1,my_mode_opts);
end
% Run other code
...
% Now fetch the results
for ii = 1:N
[idx,results] = fetchNext(f);
end
OPTION #3
Rather than sticking with a local profile, use a Slurm profile and then expand Option #2 to use a much larger parallel pool (notice in this Slurm script, we're only requesting a single core since parpool will request the larger pool of cores). This will make use of the MATLAB Parallel Server.
#!/bin/bash
#SBATCH -J my_script
#SBATCH --output=/scratch/%u/%x-%N-%j.out
#SBATCH --error=/scratch/%u/%x-%N-%j.err
#SBATCH -p 24hour
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --tasks=1
#SBATCH --mail-type=BEGIN,END,FAIL
SBATCH --mail-user sebastian.rosier@northumbria.ac.uk
module load MATLAB/R2018a
matlab -nodesktop -r "my_script; quit"
We'll use parfor here, but we could have used parfeval as well. This assume a 'slurm' profile has been created. Contact Technical Support (support@mathworks.com) if you need help.
c = parcluster('slurm');
p = c.parpool(100);
parfor ii = 1:N
results{ii} = my_model(my_model_opts);
end
  3 Comments
Sebastian Rosier
Sebastian Rosier on 4 Mar 2021
Hi Raymon,
Thanks for the detailed answer! I'll have a go implementing this on the cluster and contact support if I run into further problems.
Sebastian

Sign in to comment.

More Answers (0)

Products


Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by