Speed of the abs function

24 visualizaciones (últimos 30 días)
Arthur LOPEZ
Arthur LOPEZ el 5 de Dic. de 2023
Editada: James Tursa el 6 de Dic. de 2023
Hello everybody!
I had a question about how matlab computed the absolute value function.
Indeed, i was working on a school project and had to optimize a code that returned the smallest euclidian distance from the origin of a 3-D cloud of points. Simple enough, my first thought was to use the basic :
d = x.^2 + y.^2 + z.^2;
dmin = sqrt(min(d));
But then, thinking that squaring was a expensive operation, especially on a data set that can be bilions of points big, i tried to implement a "pre-check". My idea was the following: the 1-norm will be way cheaper to compute and so i will be able to compute a first "estimate" of the smallest distance very quickly. I will then use this to only compute the 2-norm of the points that are "close enough" to the origin on a way smaller data set.
so i did this:
dEstimate = abs(x) + abs(y) + abs(z);
dMinEstimate = min(dEstimate);
threshold = dMinEstimate*sqrt(3); %the closest point in the 2-norm can be at most sqrt(3) times further than the closest in the 1-norm.
potentialMin = find(dEstimate < threshold);
d = x(potentialMin).^2 + y(potentialMin).^2 + y(potentialMin).^2
dmin = sqrt(min(d));
But to my surprise, when using the profile viewer to compare the codes, this is what i got:
As you can see, the results are averaged over 100 calls in order to reduce processing noise. Keep in mind that the data is one milion points here.
How can the square operation time be close or even faster than the absolute value?
I would assume that the absolute value is computed by setting to 0 the sign bit in the IEEE 754 double convention, which is a very simple operation while the squaring is a much more complex process.
Is matlab this optimised? How can this be?
Thanks in advance for your answer !
Arthur

Respuesta aceptada

James Tursa
James Tursa el 5 de Dic. de 2023
Editada: James Tursa el 6 de Dic. de 2023
Although I applaud your thinking about this problem, this level of optimization is hard to achieve with a language such as MATLAB because there is a lot going on in the background that you are not considering beyond just the abs( ) vs squaring comparison. E.g.,
d = x.^2 + y.^2 + z.^2; % passes the x, y, z arrays through the CPU. Allocates d same size as x, y, and z
dmin = sqrt(min(d)); % passes d through the CPU. Then sqrt( ) of a scalar
So, four potentially large arrays get passed through the CPU along with some ops and a potentially large new allocation.
Compared to this:
dEstimate = abs(x) + abs(y) + abs(z); % passes the x, y, z arrays through the CPU. Passes abs( ) of them through CPU also. Allocates dEstimate same size as x, y, and z
dMinEstimate = min(dEstimate); % min( ) passes dEstimate through the CPU
threshold = dMinEstimate*sqrt(3); % Scalar ops
potentialMin = find(dEstimate < threshold); % passes dEstimate through the CPU again, allocates potentialMin
d = x(potentialMin).^2 + y(potentialMin).^2 + y(potentialMin).^2; % New allocations for the subscripting, passing through CPU, etc.
dmin = sqrt(min(d)); % passes d through the CPU, then scalar op
So, lots more going on in the background here with how much data is being fed through the CPU, new memory allocations, etc. I don't know how one would necessarily expect this to be either slower or faster than the original given all that is going on. And depending on how large the arrays are I'm not sure what cache hit rate you might expect either. So results may or may not scale well.
If you were coding in C/C++ then maybe you could look at something like this and try to optimize cache hits and minimize large temporary array allocations etc. For this particular problem, in C/C++ all you need is a simple loop with no large temporary array allocations needed at all. But for a language like MATLAB this type of optimization is not worth it IMO.
  1 comentario
Walter Roberson
Walter Roberson el 5 de Dic. de 2023
If you manage to find tables of the throughput and latency of different instructions for various generations of Intel architectures, then you will see that at various times:
  • FABS has been the same speed as FMUL
  • FABS has been the same speed as FMUL but had less latency
  • FABS has been the same speed and latency as FMUL but can only be started every 2 cycles where FMUL could be started every cycle
  • FABS has been the same speed and latency as FMUL but can only be started every 4 cycles where FMUL could be started every 2 cycles
  • FABS has been 1 cycle, latency 1, startable every cycle, but FMUL was 1 cycle, latency 5, 2 startable every cycle
  • FABS has been 1 cycle, latency 1, but FMUL was 2 cycles
... and other combinations.
So, FABS is anywhere from "always faster than FMUL" to "same speed as FMUL but you can't start as many FABS in a row" to "faster than FMUL but you can do more FMUL at the same time"... depending on exact architecture.

Iniciar sesión para comentar.

Más respuestas (2)

Chunru
Chunru el 5 de Dic. de 2023
" squaring was a expensive operation"
This is ture for VERY OLD cpu or some special purpose CPUs. In modern general purpose CPU, such as Intel one, the difference among square/multiplication/abs/addition is no longer significant (subjective here). For example, a CPU can perform 4 ADD/cycle or 4 MUL/cycle (floating point). The abs operation may not be much simpler than square.
If you consider the FP addition, abs, multiplier (of similar speed), with the overhead such memory access, loop control, the two approaches you used above may have ver similar performance. (It is also noted additional overhead for testing and comparison).
So in modern general purpose CPU, it may not worth to do the "optimization" mannually. (Some compilers may do optimisation better).

Walter Roberson
Walter Roberson el 5 de Dic. de 2023
We have to watch out for measurement error due to Execution Engine processing
format long g
N = 10000;
x = randn(N, 1); y = randn(N,1); z = randn(N,1);
M = 10;
t1a = zeros(1,M);
t1b = zeros(1,M);
t2a = zeros(1,M);
t2b = zeros(1,M);
for K = 1 : M; start = tic; d = x.^2 + y.^2 + z.^2; t1a(K) = toc(start); end
for K = 1 : M; start = tic; d = abs(x) + abs(y) + abs(z); t2a(K) = toc(start); end
for K = 1 : M; start = tic; d = x.^2 + y.^2 + z.^2; t1b(K) = toc(start); end
for K = 1 : M; start = tic; d = abs(x) + abs(y) + abs(z); t2b(K) = toc(start); end
t1a
t1a = 1×10
1.0e+00 * 0.000268 0.000168 1.8e-05 1.5e-05 1.8e-05 1.5e-05 6e-06 5e-06 5e-06 5e-06
t1b
t1b = 1×10
1.0e+00 * 0.00019 1.7e-05 1.3e-05 1.3e-05 1.8e-05 1.4e-05 5e-06 5e-06 5e-06 5e-06
t2a
t2a = 1×10
1.0e+00 * 0.000346 4.2e-05 0.000113 2.6e-05 9e-06 7e-06 7e-06 6e-06 6e-06 6e-06
t2b
t2b = 1×10
1.0e+00 * 0.000192 3.1e-05 2.5e-05 4.3e-05 8e-06 7e-06 6e-06 6e-06 6e-06 6e-06
plot([t1a; t1b; t2a; t2b].')
legend({'squares#1', 'squares#2', 'abs#1', 'abs#2'})
plot([t1b; t2b].')
legend({'squares#2', 'abs#2'})
For reasons currently unknown, the abs() version consistently is slower at execution 3 or 4 compared to 2, but by the end of the run the timings for the two versions are virtually identical.

Categorías

Más información sobre Simulink en Help Center y File Exchange.

Productos


Versión

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by