Parfor HPC Cluster - How to Assign Objects to Same Core Consistently?

15 visualizaciones (últimos 30 días)
Hello,
TLDR: Is there a way to force Matlab to consistently assign a classdef object to the same core? With a parfor loop inside another loop?
Details:
I'm working on a fairly complex/large scale project which involves a large number of classdef objects & a 3D simulation. I'm running on an HPC cluster using the Slurm scheduler.
The 3D simulation has to run in a serial triple loop (at least for now; that's not the bottleneck).
The bottleneck is the array of objects, each of which stores its own state & calls ode15s once per iteration. These are all independent so I want to run this part in a parfor loop, and this step takes much longer than the triple loop right now.
I'm running on a small test chunk within the 3D space, with about 1200 independent objects. Ultimately this will need to scale about 100x to 150,000 objects, so I need to make this as efficient as possible.
It looks like Matlab is smartly assigning the same object to the same core for the first ~704 objects, but then after that it randomly toggles between 2 cores & a few others:
This shows ~20 loops (loop iterations going downwards), with ~1200 class objects on the X axis; the colors represent the core/task assignment on each iteration using this to create this matrix:
task = getCurrentTask();
coreID(ti, ci) = task.ID;
This plot was created assigning the objects in a parfor loop, but that didn't help:
The basic structure of the code is this:
% pseudocode:
n_objects = 1200; % this needs to scale up to ~150,000 (so ~100x)
for i:n_objects
object_array(i) = constructor();
% also tried doing this as parfor but didn't help
end
% ... other setup code...
% Big Loop:
dt = 1; % seconds
n_timesteps = 10000;
for i = 1:n_timesteps
% unavoidable 3D triple loop update
update3D(dt);
parfor j = 1:n_objects
% each object depends on 1 scalar from the 3D matrix
object_array(i).update_ODEs(dt); % each object calls ode15s independently
end
% update 3D matrix with 1 scalar from each ODE object
end
I've tried adding more RAM per core, but for some reason, it still seems to break after the 704th core, which is interesting.
And doing the object initialization/constructors inside a parfor loop made the initial core assignments less consistent (top row of plot).
Anyway, thank you for your help & please let me know if you have any ideas!
I'm also curious if there's a way to make the "Big Loop" the parfor loop, and make a "serial critical section" or something for the 3D part? Or some other hack like that?
Thank you!
ETA 7/28/25: Updated pseudocode with dt & scalar values passing between 3D simulation & ODE objects
  3 comentarios
Douglas Brantner
Douglas Brantner el 28 de Jul. de 2025
Editada: Douglas Brantner el 28 de Jul. de 2025
Thanks for your help - I'm not sure if this answers your question, but I need the external "big loop" because I'm taking 1 time step in the 3D simulation, then 1 time step in the ODE simulation, and I need to iterate over the two because they influence each other.
There are likely several thousand timesteps, so wouldn't creating & destroying the objects on each iteration add a lot of overhead?
I suppose the object might be overkill (there's a lot of support code, plotting, analysis, etc. that is not specifically needed for the simulation itself)... but I need the states vector for each ODE instance to be preserved over the "Big Loop" and updated on each step, for each object (which there will be ~150,000 and possibly even more at full scale).
I could make a matrix of state vectors & just call the ODE solver on each row/column of the matrix in parallel... but is there a way to force assignment to a specific core's local RAM/memory so the ODE solvers will run nicely in parallel without passing lots of data back & forth?
There's only 1 scalar number (per voxel/instance) that actually needs to be passed back & forth from the 3D simulation to the ODE instances.
Thanks!
PS - I updated the pseudocode a bit w/ the scalar interaction to clarify.
Douglas Brantner
Douglas Brantner el 28 de Jul. de 2025
I also tried breaking the parfor loop into blocks, where (for example) each block has size 32 if there are 32 workers.
This made the striping very nice in the object/core assignment graph, but it was *significantly* slower than one big parfor.
n_workers = 32
n_blocks = n_objects / n_workers
for %big loop over timesteps
% 3D simulation (triple loop)
for i = 1:n_blocks
parfor j = 1:n_workers
% loop over 1 block of objects at a time
% to try to force each object to the same core each time
end
end
end

Iniciar sesión para comentar.

Respuestas (1)

Edric Ellis
Edric Ellis el 29 de Jul. de 2025
I think this might be a case for spmd. With spmd, you can ensure you construct the objects on particular workers, and only ever operate on them there. The following code assumes you can divide the number of objects evenly (if you can't, you'll need to do a bit more bookkeeping).
spmd
% Construct objects direct on the workers
n_per_worker = n_objects / spmdSize;
for i = 1:n_per_worker
object_array(i) = constructor();
end
% Big loop
dt = 1; % seconds
n_timesteps = 10000;
for i = 1:n_timesteps
update3D(dt); % Not sure what this needs to modify...
for j = 1:n_per_worker
object_array(j).update_ODEs(dt);
% Extract the scalar from each object
scalar_per_obj(j) = object_array(j).get_scalar();
end
% Get all the scalars across all workers
all_scalars = spmdCat(scalar_per_obj);
% Do something with all_scalars...
end
end
In this sketch, each worker constructs a vector of objects, and then operates on them independently. The spmdCat is an example showing how all workers can get all the scalar values, which I'm assuming they need to proceed to the next timestep. If you wish, you could have that piece run on only one worker by doing something more like this:
% call spmdCat, with result only on worker 1
dim = 1; % concatenation dimension
destination = 1;
all_scalars = spmdCat(scalar_per_obj, dim, destination);
if spmdIndex == destination
result = sum(all_scalars.^2);
% send result to all workers
spmdBroadcast(destination, result);
else
% Get result from "destination"
result = spmdBroadcast(destination);
end
  1 comentario
Douglas Brantner
Douglas Brantner el 30 de Jul. de 2025
Editada: Douglas Brantner el 30 de Jul. de 2025
Thank you! I just started reading about spmd and I think you might be right.
I'm also looking at "codistributed" arrays which it seems lets you split the data among many cores & "pin" it there. I might wind up abandoning the class & just making a large codistributed matrix where each row or column is the states array for 1 ODE solver.
Are there any memory optimizations that can be made for repeated calling of ode15s? Like 'persistent' variables for internal data inside the ODE function? (There are a lot of intermediate values & sub-equations within the ODE system, so any way to avoid re-allocating that on each call on each object would help...)

Iniciar sesión para comentar.

Categorías

Más información sobre Parallel for-Loops (parfor) en Help Center y File Exchange.

Productos


Versión

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by