apple silicon and parfor efficiency
68 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Michael Brown
el 4 de Oct. de 2023
Respondida: Mrutyunjaya Hiremath
el 24 de Oct. de 2023
Macs with apple silcon have fast unified memory that is closely tied to the cpu. The ultra chip places two 12 cpu core chips together. One would think that parallel processing using parfor should show excellent scaling as the number of cores is increased. The drop in efficiency associated with "communication overhead" has been frequently discussed in this community. My question was how the M2 ultra stacks up and whether there are differences in efficiency related to hardware. I found a test provided by MATHWORKS, doubled the size of the problem, and ran it with 1 to 24 cores
max_num_cores=24;
tj=zeros(max_num_cores,1);
for j=1:max_num_cores
tic
n = 200;
A = 500;
a = zeros(n);
b = zeros(n);
parfor (i = 1:n,j)
a(i) = max(abs(eig(rand(A))));
b(i) = max(abs(eig(rand(A))));
end
tj(j)=toc;
end
I found that the time per core had only increased 10% when running from one to 15 cores. In the panel on the left the thick line scales the 1-core time by the number of cores allowed in the parfor loop. You can see that data points (black circles) are almost on the perfect scaling line out to 15. Adding core 16 to 24 did not improve the total time and reduced the efficiency per core significantly. The right side panel takes the percentage difference between "perfect" scaling and the actual time.
I then ran one of my calculations that involves very large sparse arrays. The time for a single run through the parfor is minutes instead of the second in the previous example and 80-90 GB of ram (out of 128 GB) is involved in holding the various arrays. Here is the same plot:
Not unsurprising is that the efficiency is not as high as the first example. And: adding cores beyond 12 did not speed the problem further. With 12 cores the parfor loop is 50% slower than the extrapolated perfect scaling. The open circle for one worker is the time using a for rather than parfor loop - here there was an upfront cost of 15 seconds just to make the change.
In both examples, I ran into the wall where more cores did not help. Questions:
- What is creating the wall?
- How different is this behavior relative to other hardware? Is the apple silicon "better" or "worse" that other hardware in such tests?
- Could the fact that there are two 12 core processors joined together be the cause of having a performance hit for pools larger than about 12?
- 2023b is the first native version for apple silicon. Could there be further optimization on parfor execution for this hardware?
7 comentarios
Infinite_king
el 12 de Oct. de 2023
Editada: Infinite_king
el 13 de Oct. de 2023
"What is creating the wall"
As @Ben Tordoff pointed out, M2 Ultra has 16 performance cores. So, the noticeable drop in efficiency is probably due to the usage of efficiency cores after exhausting all available performance cores.
On top of this, the efficiency of parallel program will depend on few more factor like,
- Communication overhead (as you pointed out).
- Memory usage pattern (maintenance of memory consistency, cache hits).
- The context of the processor (the number of other processes that are running on the CPU at the time of job submission).
In general you may notice a high drop in efficiency or deviation from ideal scaling even before reaching the maximum number of cores in that processing unit.
"How different is this behavior relative to other hardware? Is the apple silicon "better" or "worse" that other hardware in such tests?"
In AMD Ryzen 5 Pro with 6 Cores, we can see the following behaviour, With the 6 core machine, the drop in efficiency with the increase of cores is not that significant.
.
With the AMD epyc 74f3 24 core processor, (using only 8 cores in VM). We can see that the behaviour is little different.
If i get a chance, then i will run this test in a 24 core machine and i will let you know the behaviour.
Hope this is helpful.
Respuesta aceptada
Mrutyunjaya Hiremath
el 24 de Oct. de 2023
Your observations and analysis are very insightful, and you've touched upon several important points regarding parallel computing, especially in the context of modern heterogeneous computing architectures like Apple's M1 and M2 chipsets.
Let's address your questions:
1. What is creating the wall?
The "wall" or saturation point in parallel computing can arise due to several reasons:
- Memory Bandwidth: Even with fast unified memory, there's a limit to how much data can be accessed simultaneously. As more cores are added, contention for memory can become a bottleneck.
- Inter-Core Communication: Depending on the nature of the problem, cores might need to communicate results or synchronize. This overhead can grow with more cores.
- Thermal and Power Limits: As more cores are active, the chip can hit power or thermal limits, causing it to reduce the operating frequency of the cores, leading to reduced performance.
- Complexity of Task: If the task has a significant portion that cannot be parallelized (Amdahl's Law), then the benefits of adding more cores diminishes.
2. How different is this behavior relative to other hardware? Is the Apple silicon "better" or "worse" than other hardware in such tests?
Apple Silicon, especially the M-series chips, is designed for efficiency and offers impressive performance per watt. However, its behavior compared to other hardware (e.g., Intel, AMD, NVIDIA for GPU computations) will depend on the specific task and the nature of the parallelism. In some tasks, Apple Silicon might excel due to its unified memory and high-efficiency cores, while in others, traditional high-performance cores or GPUs might have an advantage. Direct comparisons require benchmarking on a case-by-case basis.
3. Could the fact that there are two 12 core processors joined together be the cause of having a performance hit for pools larger than about 12?
Yes, it could be a factor. When cores are on separate chips or chiplets, the communication between them can be slower than communication within the same chip. This can introduce overheads and diminish the benefits of adding more cores beyond a certain point.
4. 2023b is the first native version for Apple silicon. Could there be further optimization on parfor execution for this hardware?
Absolutely. As with any first release, there's potential for further optimizations as MathWorks gains more experience with Apple Silicon and as Apple releases more updates and tools for developers. The behavior and performance you observe in the initial release might improve in subsequent versions.
Suggestions:
- Use MATLAB's profiler to gain insights into where the time is being spent.
- For memory-intensive tasks, monitor memory usage and bandwidth. Tools like Apple's Instruments can provide detailed insights.
- For tasks that might involve inter-core or inter-chiplet communication, try partitioning the data or task differently to minimize communication overhead.
- Stay updated with MATLAB's releases and check release notes for performance improvements related to Apple Silicon.
Lastly, sharing your findings with MathWorks can be beneficial. They might provide specific insights or optimizations for your use case, or your feedback might help them improve future versions of MATLAB for Apple Silicon.
0 comentarios
Más respuestas (0)
Ver también
Categorías
Más información sobre Parallel Computing Fundamentals en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!