apple silicon and parfor efficiency
Mostrar comentarios más antiguos
Macs with apple silcon have fast unified memory that is closely tied to the cpu. The ultra chip places two 12 cpu core chips together. One would think that parallel processing using parfor should show excellent scaling as the number of cores is increased. The drop in efficiency associated with "communication overhead" has been frequently discussed in this community. My question was how the M2 ultra stacks up and whether there are differences in efficiency related to hardware. I found a test provided by MATHWORKS, doubled the size of the problem, and ran it with 1 to 24 cores
max_num_cores=24;
tj=zeros(max_num_cores,1);
for j=1:max_num_cores
tic
n = 200;
A = 500;
a = zeros(n);
b = zeros(n);
parfor (i = 1:n,j)
a(i) = max(abs(eig(rand(A))));
b(i) = max(abs(eig(rand(A))));
end
tj(j)=toc;
end
I found that the time per core had only increased 10% when running from one to 15 cores. In the panel on the left the thick line scales the 1-core time by the number of cores allowed in the parfor loop. You can see that data points (black circles) are almost on the perfect scaling line out to 15. Adding core 16 to 24 did not improve the total time and reduced the efficiency per core significantly. The right side panel takes the percentage difference between "perfect" scaling and the actual time.

I then ran one of my calculations that involves very large sparse arrays. The time for a single run through the parfor is minutes instead of the second in the previous example and 80-90 GB of ram (out of 128 GB) is involved in holding the various arrays. Here is the same plot:

Not unsurprising is that the efficiency is not as high as the first example. And: adding cores beyond 12 did not speed the problem further. With 12 cores the parfor loop is 50% slower than the extrapolated perfect scaling. The open circle for one worker is the time using a for rather than parfor loop - here there was an upfront cost of 15 seconds just to make the change.
In both examples, I ran into the wall where more cores did not help. Questions:
- What is creating the wall?
- How different is this behavior relative to other hardware? Is the apple silicon "better" or "worse" that other hardware in such tests?
- Could the fact that there are two 12 core processors joined together be the cause of having a performance hit for pools larger than about 12?
- 2023b is the first native version for apple silicon. Could there be further optimization on parfor execution for this hardware?
7 comentarios
Ben Tordoff
el 5 de Oct. de 2023
Hi Michael, thanks for posting such a detailed question. I don't know all the answers, but one thing I do know it that the M2 Ultra has 16 performance cores and 8 efficiency cores. That means that once you start using more than 16 workers you are either going to be competing with yourself for the performance cores or one worker is going to be running on the much slower efficiency cores. I would therefore expect that there is no advantage going beyond 16 workers. You can see this quite clearly in your first pair of plots. Hopefully others can speak to your remaining observations.
Walter Roberson
el 5 de Oct. de 2023
For whatever it is worth:
On a 3.6 GHz 10-Core Intel Core i9 iMac, if you plot(gradient(gradient(tj)) then it is clear that performance gains on the system dive substantially after 5 cores used, but performance does improve (a little) up to 10 cores.
Mike Croucher
el 5 de Oct. de 2023
What kind of parallel pool are you using? Threads or processes?
Mike Croucher
el 5 de Oct. de 2023
Also, is it possible to get more of an idea on what the sparse matrix benchmark you're using looks like?
Agree with @Ben Tordoff on the performance cores vs efficiency cores thing. This type of architecture is becoming more common -- on Intel and Apple silicon (and maybe more -- not sure off the top of my head). At the moment, I don't understand what performance I should expect at all from these things -- or how to code for them. I'm not even sure how work gets allocated to the processors. Forget MATLAB for a second...i'm thinking about simple OpenMP. I ask for 2 threads. Will I get 2 performance cores, 2 efficiency cores, 1 of each? Does the OS make the decision? Does the programmer?
I have so. many. questions! I guess I've got some reading to do....
Walter Roberson
el 5 de Oct. de 2023
If you were using parpool('threads') then the parfor would give an error about not being able to control the number of workers in a threaded environment.
If you were using backgroundPool() then parfor() would go ahead and open a normal pool because the pool is not being passed to parfor()
We can therefore deduce that the user is using normal parpool
Michael Brown
el 6 de Oct. de 2023
Infinite_king
el 12 de Oct. de 2023
Editada: Infinite_king
el 13 de Oct. de 2023
"What is creating the wall"
As @Ben Tordoff pointed out, M2 Ultra has 16 performance cores. So, the noticeable drop in efficiency is probably due to the usage of efficiency cores after exhausting all available performance cores.
On top of this, the efficiency of parallel program will depend on few more factor like,
- Communication overhead (as you pointed out).
- Memory usage pattern (maintenance of memory consistency, cache hits).
- The context of the processor (the number of other processes that are running on the CPU at the time of job submission).
In general you may notice a high drop in efficiency or deviation from ideal scaling even before reaching the maximum number of cores in that processing unit.
"How different is this behavior relative to other hardware? Is the apple silicon "better" or "worse" that other hardware in such tests?"
In AMD Ryzen 5 Pro with 6 Cores, we can see the following behaviour, With the 6 core machine, the drop in efficiency with the increase of cores is not that significant.


.
With the AMD epyc 74f3 24 core processor, (using only 8 cores in VM). We can see that the behaviour is little different.


If i get a chance, then i will run this test in a 24 core machine and i will let you know the behaviour.
Hope this is helpful.
Respuesta aceptada
Más respuestas (0)
Categorías
Más información sobre Loops and Conditional Statements en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!