Efficiently submit jobs to a SLURM cluster

Question

Joh Hag el 25 de Oct. de 2018

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/426124-efficiently-submit-jobs-to-a-slurm-cluster

Editada: Joh Hag el 26 de Oct. de 2018

Hi everybody, I have a computation which consists of 900 datasets. Each dataset needs roughly 1 hour to process on a GPU. I have access to a SLURM cluster with multiple partitions. If not all nodes of a partition are used, non-members can also submit jobs on these nodes. But when a member of that partition needs the node you are on, that member has priority and you get kicked. In my current solution I use parfeval to carry out batched calculations of the datasets. But if someone with higher priority wants one of my allocated nodes, the whole computation dies or job in terms of Slurm. So I looked into putting each dataset in its own job and put it separately in the queue. This is the code I have come so far up with.

 % look up the free gpu nodes with 
% sinfo -N -p all -t idle -o "%20n  %50f" | grep P100 
% then we can directly adapt the number of workers via the script
max_num_workers = 5;
par_cluster = parcluster('my_cluster');
par_cluster.AdditionalProperties.AdditionalSubmitArgs = ...
    ['--nodes=1 --constraint=P100 --partition=all --time=1-00:00:00 ']; % 
%%local cluster for parallel job submission
ppool = parpool('local');
%%setup book keeping
    completed_idx = false(size(data, 4),1);
    runing_idx = completed_idx;
    jobs = cell(size(data, 4),1);   
%%setup jobs
for ii = 1 : numel(jobs)
    jobs{ii} = createJob(par_cluster);
    % provide each worker with some files needed to run the reconstruction files
    jobs{ii}.AttachedFiles = {'files ...'};
    guess = pad_to_size(reconstruction(:,:,ii), p.rec_height, p.rec_width);%      
    createTask(jobs{ii}, @eval_reconstruction, 7, ...
        {data(:,:,:,ii), pca_empty, guess,...
        support, p, transit, ii});
end
%%submit jobs
for ii = 1 : numel(jobs)
    submit(jobs{ii})
end
%%get results
finished_jobs = findJob(par_cluster, 'Type', 'independent', 'State', 'finished');
for ii = 1:numel(finished_jobs)
   result = fetchOutputs(finished_jobs(ii))
   delete(finished_jobs(ii))
   jobs{ii} = [];
   % further process results...
end

So first question is this ansatz the right direction to go? My problem with this code is that the job creation is fast at the beginning but gets really slow after the first 5is jobs are created, not yet submitted. The submission process itself is then also very slow. This is also the case when I just use the simple rand function from the documentation's example. Exchanging the for-loop with a parfor to speed things up, yielded some interesting interference between the local workers. Some workers wanted to simultaneously write matlab_metadata.mat which did not go well... I ran this code with R2017b.

Thanks for your input.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Joh Hag el 26 de Oct. de 2018

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/426124-efficiently-submit-jobs-to-a-slurm-cluster#answer_343614

Editada: Joh Hag el 26 de Oct. de 2018

hi, I was able to create the jobs now via spmd. This circumvented the interference issues for the parallel job creation and submission. What now happens is, that the jobs disappear when the code exits the spmd region. It is not possible to recover them from the cluster. Which should in principle be possible if I understand the documentation correctly.

best

-johannes

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Efficiently submit jobs to a SLURM cluster

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

Efficiently submit jobs to a SLURM cluster

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos