Effectiveness of implicit and explicit parallelization of fft2
9 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Matt
el 17 de Mzo. de 2024
Comentada: Damian Pietrus
el 21 de Mzo. de 2024
Hi,
I have many images on which I want to do some FFTs. I want to accelerate this image processing with multithreading. I am wondering why explicit parallelization with parfor as shown in the following example is not largely more efficient than the implicit one from the matlab function fft2 :
clear
%% parameters & variable declaration
N_slice = 40 ;
N_img=80;
a = rand(1024,1024,N_slice);
for ii=1:N_img
img{ii}=rand(1024,1024);
out{ii}=0;
end
%% multithreading inside the loop
maxNumCompThreads(4);
% 39 s with 4 threads
% 88 s with 1 thread
% parallelization efficiency of 88/(39*4)=56%
tic
for ii=1:N_img
out{ii}=sum(ifft2(a.*fft2(img{ii})),'all');
end
toc
%% parfor version
% 33 s with parpool(4)
% parallelization efficiency of 88/(32*4)=68%
maxNumCompThreads(1); %avoid the relative inefficiency of fft2 parallelisation
pool = parpool('Processes',4,'SpmdEnabled',false);
b=parallel.pool.Constant(a);
tic
parfor ii=1:N_img
% out{ii}=sum(ifft2(b.Value.*fft2(img{ii})),'all');
out{ii} = fun(b.Value,img{ii});
end
toc
function out=fun(x,y)
% Just in case, I read somewhere that parpool might be faster
% using functions rather than script - change nothing here.
out = sum(ifft2(x.*fft2(y)),'all');
end
FFT2 is supporting multithreading by default and on my pc with 4 cores, I get a 56% multithreading efficiency (defined a T1/(T2*N) with T1 the time to do a task with one thread, T2 the time with N thread).
I am expecting to get a larger multithreading efficiency by creating a pool of 4 workers, and making them work on my images in paralle. Because there is so little data transfer and so much computation time I am expecting to get a multithreading efficiency close to one. But I only get around 68% as you can see in my exemple. Why is that ?
Matthieu
0 comentarios
Respuesta aceptada
Damian Pietrus
el 19 de Mzo. de 2024
Hello Matt,
As mentioned already, there could be a few reasons for this. When you start up a pool of 4 workers in the "Processes" profile, you are starting 4 separate MATLAB workers, each with their own process and memory space. There does end up being some overhead with parfor, as MATLAB not only needs to do some data transfer to each worker, but also needs to break the problem up in the first place. FFT is already well optimized and can avoid some of this overhead.
One thing I'd recommend as a comparison point is to use the "Threads" cluster profile instead of "Processes". Here, parallel workers run as threads within the main MATLAB process and share memory, so I'm curious to see if/how that impacts your overall results.
2 comentarios
Damian Pietrus
el 21 de Mzo. de 2024
Based on the timing of your calculations, I think the difference in effeciency comes down to the overhead required in sharing data between workers. Threaded workers can directly access the same memory space, while process workers require a bit more overhead in getting the right data copied to the right process.
Más respuestas (1)
Pratyush
el 18 de Mzo. de 2024
Hi Matt,
The observed discrepancy in multithreading efficiency when using MATLAB's "parfor" for FFT operations on images, achieving around 68% efficiency instead of the expected near-perfect efficiency, can be attributed to several factors:
- Setting up parallel pools, transferring data, and scheduling tasks introduce overhead that reduces overall efficiency.
- MATLAB's "fft2" is already optimized for multithreading, making explicit parallelization less beneficial.
- Due to Amdahl's Law, as more workers are added, the speedup gained from parallelization tends to plateau because of the non-parallelizable portions of the task and increasing overhead.
- Competition for CPU time and memory bandwidth among workers and the main MATLAB process can hinder performance.
- While "parfor" is efficient for many tasks, its overhead can impact tasks that are already optimized, like FFT operations.
The gist is, achieving near-linear scaling in parallel computing is challenging, especially for already optimized operations. The 68% efficiency observed is relatively good, considering these factors. Optimizing parallel efficiency might involve minimizing data transfer, experimenting with the number of workers, and ensuring optimal hardware and MATLAB configurations.
Ver también
Categorías
Más información sobre Parallel Computing Fundamentals en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!