Why workers keep aborting during parallel computation on cluster?
Mostrar comentarios más antiguos
I keep getting the warning
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 392)
In parallel_function>distributed_execution (line 741)
In parallel_function (line 573)
In fuction_pa1 (line 100)]
when I run a simulation that has parfor loop on the cluster. I noticed that workers abort excution one after another and that seems to happen more when on a cluster compated to my PC.
I would like to know the reason of this issue, and is there a way to avoid it ?
Thanks.
19 comentarios
Mario Malic
el 7 de Dic. de 2020
Whhat kind of simulation?
Muh Alam
el 7 de Dic. de 2020
Kojiro Saito
el 8 de Dic. de 2020
matlab_crash_dump files might be stored in JobStrageLocation of parallel workers.
c=parcluster();
c.JobStorageLocation
Muh Alam
el 9 de Dic. de 2020
Kojiro Saito
el 9 de Dic. de 2020
Does your code have file I/O? For example, save.
Parallel workers might crash if multiple workers try to write to the same file.
Muh Alam
el 9 de Dic. de 2020
Kojiro Saito
el 10 de Dic. de 2020
No, I meant save inside parfor loop. But you're using save after parfor loop, it's safe.
Did you try changing SpmdEnabled option to false?
parpool('SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
Muh Alam
el 10 de Dic. de 2020
Kojiro Saito
el 10 de Dic. de 2020
OK. Does this occur if you require smaller wokers?
Such as,
parpool(2, 'SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
Muh Alam
el 10 de Dic. de 2020
Kojiro Saito
el 11 de Dic. de 2020
Does your cluster have enough resource?
If Linux, from Terminal
ulimit -a
provides the resouce (max processes etc.).
Muh Alam
el 14 de Dic. de 2020
Muh Alam
el 3 de Feb. de 2021
Kojiro Saito
el 3 de Feb. de 2021
I don't think so. I think it is an usual script.
Are you able to check the SLURM's log file?
Kojiro Saito
el 4 de Feb. de 2021
I understood. It was related to memory error. As you mentioned, increasting the allocated memory such as "--mem-per-cpu=2G" in sbatch option would solve.
Muh Alam
el 6 de Feb. de 2021
Kojiro Saito
el 7 de Feb. de 2021
Heterogenous would be a cause. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."
Muh Alam
el 8 de Feb. de 2021
Respuesta aceptada
Más respuestas (0)
Categorías
Más información sobre Third-Party Cluster Configuration en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!