Parfor loop just hangs, CPU usage goes to zero

Hi all. Here is a sample code of what I am attempting to run.
parfor i = 1:num
answer(:,i) = someFunction(someData(:,i));
end
Key information: "someFunction" is a C++ mex file. "someData" is a memmapfile (memmapfilename.data) because it is too large to be loaded onto each worker
Oddly, the parfor loop just hangs, the CPU usage goes to zero, and when I CTRL+C, here is what I get:
Operation terminated by user during distcomp.remoteparfor/getCompleteIntervals (line
127)
In parallel_function>distributed_execution (line 820)
[tags, out] = P.getCompleteIntervals(chunkSize);
In parallel_function (line 587)
R = distributed_execution(...
This isn't an issue if I replace the "parfor" with a simple "for" - everything works fine. What seems to happen is that some of the workers become unresponsive. After the above issue is encountered, even running a simple command such as
pctRunOnAll 1+1
will return "2" on only some, but not all, workers.
Any help would be great. A fresh re-installation did not help. Validation for "parpool" passed.

6 comentarios

Benoit Pouyatos
Benoit Pouyatos el 2 de Mayo de 2017
Hi there, I have the same issue here using Matlab 2017a. One peculiar fact is that the exact same code never hangs with Matlab 2014a. Did somebody find a fix for this problem?
arvid Martens
arvid Martens el 1 de Ag. de 2017
Recently, I have been experiencing the same problem with parfor in Matlab 2017a. Has a solution to this problem been found?
Cole
Cole el 4 de En. de 2018
I tried MATLAB 2014a, but the same error occurred.
John Tencer
John Tencer el 3 de Abr. de 2018
I'm observing this behavior with 2017b.
David Saidman
David Saidman el 14 de En. de 2020
Did anybody get any success with this? I'm having exact same 2017b, definitely no keyboard statement.
If I wait a bit, it ends up running but on a single CPU (event tho I have 18 in my pool on a cpu with 20 physical and 40 logical cores, about 10gb spare memory in performance monitor).
海粟 吴
海粟 吴 el 30 de Mayo de 2024
parallel.internal.parfor.ParforEngine/getCompleteIntervals
位置 parallel_function>distributed_execution (第 746 行)
[tags, out] = P.getCompleteIntervals(chunkSize);
位置 parallel_function (第 578 行)
R = distributed_execution(...
Same problem observed in 2024a, this problem remains for 7 years, and no solution came out yet.

Iniciar sesión para comentar.

Respuestas (8)

Dave Behera
Dave Behera el 24 de Mzo. de 2016

2 votos

It seems that there is a deadlock when the workers are trying to the access the file using the same object (that you got from memmapfile). Due to that, the progress is getting stalled with zero CPU usage and no abort message.
Can you try creating a separate memmapfile object within each parfor iteration and passing it to the someFunction function? This may make the file access thread-safe.
Also, could you try the same workflow with spmd?

9 comentarios

JohnDapper
JohnDapper el 1 de Abr. de 2016
Editada: JohnDapper el 1 de Abr. de 2016
Thanks for the answer. Making a memmapfile object inside each parfor iteration (say, mm{i}), and then passing mm{i}.Data to someFunction, did not help. Continues to hang.
Same problem with spmd.
Once again, no problem at all when I change parfor to for. You`d think MATLAB would have sorted this out before releasing memory mapping...
M L
M L el 8 de Nov. de 2016
I have the same problem-- did you ever find a fix?
Ryan
Ryan el 7 de Nov. de 2017
Editada: Ryan el 18 de Nov. de 2017
Same problem for me too... any updates? I found my issue only occurs on my workstation that has 4 CPUs (6 cores per CPU for a total of 24 workers), but not on my computer that has only 1 CPU (6 cores, 6 workers total). Anyone having this issue using a dual or more CPU socket system?
I'm having exactly the same problem. Running the loop as a for loop works fine. But running it as a parfor loop hangs with zero CPU usage. Ctrl-C results in the same error message as in the original post.
I'm not loading any data from files.
Sanjay Manohar
Sanjay Manohar el 19 de Dic. de 2017
Happened suddenly after I set a breakpoint within a subfunction.
Cole
Cole el 4 de En. de 2018
I have just tried SPMD, but it doesn't work. The similar problem still exists.
Hakon Haugnes
Hakon Haugnes el 16 de Sept. de 2019
I have the same problem on R2018a
Arabarra
Arabarra el 11 de En. de 2021
Same problem here. In my case it is not reproductible, sometimes it will work, sometimes not. No deadlocks or anything suspicious in the code.
海粟 吴
海粟 吴 el 30 de Mayo de 2024
Agree on what had been discussed above, it happened on 2024a too. When the MATLAB could solve this problem. Soon or Nerver?

Iniciar sesión para comentar.

arvid Martens
arvid Martens el 9 de En. de 2018

1 voto

I noticed that the problem started to occur after I updated the drivers of the GPUs that are being used during the calculations. Rolling back the drivers resolved the problem. However, new GPU hardware is on its way, as the current ones are pretty old. So I hope the problem is resolved by then.
Is there a way to throw an error when this stalling occurs? I could write an error handling to reduce the time lost by this stalling.
Andrea Stevanato
Andrea Stevanato el 13 de Jul. de 2018

0 votos

I have the same error with matlab 2018a.
Sanjay Manohar
Sanjay Manohar el 3 de Jun. de 2019
Editada: Sanjay Manohar el 3 de Jun. de 2019

0 votos

I was having the same parfor problem, until I noticed I had a "keyboard" instruction in my code.
DeepSea
DeepSea el 15 de Ag. de 2021
I've been stucked in this problem for couples of weeks, and fixed it by removing "continue" in an if-judgement and a for-loop.
for CondA
...
if CondB
continue; % Avoid using "continue"
end
...
end

1 comentario

海粟 吴
海粟 吴 el 30 de Mayo de 2024
My code have the similar structure, it has while in forloop

Iniciar sesión para comentar.

Aditya Shukla
Aditya Shukla el 23 de Oct. de 2021

0 votos

I suddenly got this problem since yesterday, before which all the code ran nicely. I really do not know why this happened. it is so annoying. Any one found a solution?
Tianzong Wang
Tianzong Wang el 27 de Oct. de 2022

0 votos

Same here, most cores are not working. Any suggestions? And what is the JCEF?
海粟 吴
海粟 吴 el 30 de Mayo de 2024

0 votos

Exactly same problem encountered in 2024a!!

Categorías

Más información sobre Loops and Conditional Statements en Centro de ayuda y File Exchange.

Etiquetas

Preguntada:

el 14 de Mzo. de 2016

Respondida:

el 30 de Mayo de 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by