MATLAB Answers

Parallel optimization hanging on getCompleteIntervals

20 views (last 30 days)
Samuel Nathan
Samuel Nathan on 30 Mar 2020
Commented: Samuel Nathan on 1 Apr 2020
I'm using a cloudcentre cluster with parpool and the optimization runs until suddenly hanging. The code does not always hang but does 9/10 times. Suspected deadlock but I have made sure each worker has the files it requires. After it hangs I can exit with ctr-c but i have to restart the server in order to get the optimization running again else it hangs waiting for the pool to be ready.
Init code
c = parpool('AttachedFiles',{'OptimiseModel.m','decreasing_amplitude_01.mat','ArmModelV2.slx','MapData.m','sim_model_test.m','slprj'});
mpiSettings('DeadlockDetection','on')
mpiSettings('MessageLogging','on')
mpiSettings('MessageLoggingDestination','CommandWindow')
My obj function optimise model runs a simulink model with passed values from the particle swarm algorithm
if init == true
simIn = MapData;
init = false;
end
simOut = sim(simIn);
RMSE = simOut.get('rmse');
each worker has it's own copy of simin and the init stuff is a hack to allow the fuction to be evaluated by the client instance which happens once at the beginning of the particleswarm algo. (Don't know why)
spmd
model = load_system('ArmModelV2');
set_param(model, 'SimulationCommand', 'stop')
set_param(model,'FastRestart','on');
set_param(model,'SimulationMode','Accelerator');
set_param(model,'AccelVerboseBuild','on')
simIn = MapData();
end
~~~~~~~~~~~~
fun = @(x)OptimiseModel(init,MCV_B,x(1),x(2),x(3),x(4),x(5),VMO_B,x(6),x(7),x(8),MCV_T,x(9),x(10),x(11), ...
x(12),x(13),VMO_T,x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23),x(24),x(25),x(26),x(27));
options = optimoptions('particleswarm','UseParallel',true,'UseVectorized',false,'PlotFcn','pswplotbestf');
[x,rmse_best] = particleswarm(fun,27,lb,ub,options);
All looks good until out of nowhere the workers stop running the obj function and the code hangs here which is part of the src for remoteparfor:
while isempty(r)
assert(obj.NumIntervalsInController > 0, ...
'Internal error in PARFOR - no intervals to retrieve.');
r = q.poll(1, timeUnitSeconds);
obj.displayOutput();
WHY? Can anybody help me? Can provide more of the code if required (I didn't include all as most is irrelevent - at least i thought so). Any suggestions on further debugging strategys would be great also.
Thanks alot!
EDIT Code works in serial

  1 Comment

Samuel Nathan
Samuel Nathan on 1 Apr 2020
Further Investigation is showing that a number of workers are crashing even when modifying the particleswarm parfor with instructions from https://uk.mathworks.com/help/simulink/ug/not-recommended-using-sim-function-within-parfor.html#brsk7nj looking at a way to restart workers/cancel and restart jobs on workers.

Sign in to comment.

Answers (1)

Edric Ellis
Edric Ellis on 31 Mar 2020
A few notes:
  1. The deadlock detection is for labSend and labReceive. Your parallel code is using parfor. There is no way that parfor can encounter a cyclic deadlock because the workers operating on the body of the loop do not communicate with each other (except possibly via the file system). (When writing labSend and labReceive code inside spmd, you can write a cyclic deadlock, and that's what the deadlock detection setting can help you discover).
  2. Your mpiSettings calls should be run on the workers - i.e. inside an spmd block. (But see point (1) - I don't think they're relevant here)
  3. The method getCompleteIntervals is a completely normal part of parfor operation - this is where the client waits for the workers to return their results. The only thing that you can deduce from the client waiting at that point is that the workers haven't finished their parfor loop iterations yet
  4. I am suspicious of your use of accelerated simulation mode. I'm not an expert, but I think that this might possibly cause the workers to interfere with one another via the filesystem.
Here's what I would try: try running with a parallel pool of size 1. If that fixes things, then perhaps the workers are interfering with one another via the file system.
You could force the workers to temporarily change to a unique directory prior to running the simulations by doing something like this:
% force the workers into a unique directory
spmd
myTempDir = tempname(); % tempname returns a globally unique name
oldWd = pwd();
mkdir(myTempDir);
cd(myTempDir);
end
% ... run stuff in parfor
particleswarm();
% Put the workers back into the original working directory
spmd
cd(oldWd);
end
But that's a complete stab in the dark without having reproduction steps that I can try out.

  2 Comments

Samuel Nathan
Samuel Nathan on 31 Mar 2020
I am attempting with a pool of one but will take a very long time to conclusively determine if this solves the issue. After running a few more times it does occasionally work, does this mean that it is most likely to be a file system deadlock?
EDIT Making the workers change dir did not solve the problem unfortunately.

Sign in to comment.


Translated by