Borrar filtros
Borrar filtros

Efficiently submit jobs to a SLURM cluster

8 visualizaciones (últimos 30 días)
Joh Hag
Joh Hag el 25 de Oct. de 2018
Editada: Joh Hag el 26 de Oct. de 2018
Hi everybody, I have a computation which consists of 900 datasets. Each dataset needs roughly 1 hour to process on a GPU. I have access to a SLURM cluster with multiple partitions. If not all nodes of a partition are used, non-members can also submit jobs on these nodes. But when a member of that partition needs the node you are on, that member has priority and you get kicked. In my current solution I use parfeval to carry out batched calculations of the datasets. But if someone with higher priority wants one of my allocated nodes, the whole computation dies or job in terms of Slurm. So I looked into putting each dataset in its own job and put it separately in the queue. This is the code I have come so far up with.
% look up the free gpu nodes with
% sinfo -N -p all -t idle -o "%20n %50f" | grep P100
% then we can directly adapt the number of workers via the script
max_num_workers = 5;
par_cluster = parcluster('my_cluster');
par_cluster.AdditionalProperties.AdditionalSubmitArgs = ...
['--nodes=1 --constraint=P100 --partition=all --time=1-00:00:00 ']; %
%%local cluster for parallel job submission
ppool = parpool('local');
%%setup book keeping
completed_idx = false(size(data, 4),1);
runing_idx = completed_idx;
jobs = cell(size(data, 4),1);
%%setup jobs
for ii = 1 : numel(jobs)
jobs{ii} = createJob(par_cluster);
% provide each worker with some files needed to run the reconstruction files
jobs{ii}.AttachedFiles = {'files ...'};
guess = pad_to_size(reconstruction(:,:,ii), p.rec_height, p.rec_width);%
createTask(jobs{ii}, @eval_reconstruction, 7, ...
{data(:,:,:,ii), pca_empty, guess,...
support, p, transit, ii});
end
%%submit jobs
for ii = 1 : numel(jobs)
submit(jobs{ii})
end
%%get results
finished_jobs = findJob(par_cluster, 'Type', 'independent', 'State', 'finished');
for ii = 1:numel(finished_jobs)
result = fetchOutputs(finished_jobs(ii))
delete(finished_jobs(ii))
jobs{ii} = [];
% further process results...
end
So first question is this ansatz the right direction to go? My problem with this code is that the job creation is fast at the beginning but gets really slow after the first 5is jobs are created, not yet submitted. The submission process itself is then also very slow. This is also the case when I just use the simple rand function from the documentation's example. Exchanging the for-loop with a parfor to speed things up, yielded some interesting interference between the local workers. Some workers wanted to simultaneously write matlab_metadata.mat which did not go well... I ran this code with R2017b.
Thanks for your input.

Respuestas (1)

Joh Hag
Joh Hag el 26 de Oct. de 2018
Editada: Joh Hag el 26 de Oct. de 2018
hi, I was able to create the jobs now via spmd. This circumvented the interference issues for the parallel job creation and submission. What now happens is, that the jobs disappear when the code exits the spmd region. It is not possible to recover them from the cluster. Which should in principle be possible if I understand the documentation correctly.
best
-johannes

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by