Is vectorization preserve scaling of runtime?

Question

Mr M. el 10 de Mayo de 2017

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/339667-is-vectorization-preserve-scaling-of-runtime

Editada: Jan el 19 de Mayo de 2017

Suppose I have a calculation, in which some for cycles is required inside each other. For example: for n = 1:N; for j = 1:J; for m = 1:M*J; In this case runtime scales with N, and M linearly, but runtime grows quadratically with J. If we can vectorize the calculation somehow, vectorization can speed up. But is scaling remain the same?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Jan el 11 de Mayo de 2017

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/339667-is-vectorization-preserve-scaling-of-runtime#answer_266613

Editada: Jan el 11 de Mayo de 2017

Even with vectorization the runtime will depend on the number of calculations. Sometimes vectorized code requires large temporary arrays, which do not match into the processor cache or the RAM. Then the slower RAM or even slower hard disk is used, which can degrade the speed massively. But as long as the same kind of memory can be used, the order of the execution time is the same for loops and vectorized code. Roughly.

But sometimes, vectorized code is multi-threaded, e.g. in the sum command. Then the number of free cores matters in addition above a certain size of the input. While sum(2e5) needs about twice as long as sum(1e5), this does not hold true for sum(10e5) and sum(5e5).

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Walter Roberson el 19 de Mayo de 2017

For sufficiently large arrays, prod() would be handed over to one of the fast mathematical libraries such as MKL or BLAS. Those libraries are highly optimized, and can take advantage of multiple cores, and can take advantage of advanced hardware instructions such as various SIMD (Single Instruction Multiple Data) that might allow higher hardware performance rates than might be obvious. Your exact primary and secondary cache sizes can turn out to matter, your exact CPU can matter, the exact array size can matter...

Jan el 19 de Mayo de 2017

Editada: Jan el 19 de Mayo de 2017

Abrir en MATLAB Online

The code I've posted before did not show anything relevant.

Neither BLAS nor MKL has a function for the calculation of the product of the elements.

SUM and PROD use multiple cores, when the input is larger than 88999:

x = rand(1, 88999);
tic; for i = 1:50000; v = sum(x); clear('v'); end, toc
tic; for i = 1:50000; v = prod(x); clear('v'); end, toc
x = rand(1, 89000);
tic; for i = 1:50000; v = sum(x); clear('v'); end, toc
tic; for i = 1:50000; v = prod(x); clear('v'); end, toc
 % Matlab 2016b, Core2Duo
 Elapsed time is 6.926055 seconds.
 Elapsed time is 10.872182 seconds.
 Elapsed time is 4.241430 seconds.
 Elapsed time is 6.319897 seconds.

See also the Performance tab of the task manager: For the smaller problem only 1 core is used, but 2 (or all?) for the larger one. The code is not 50% faster (on my 2-core machine), but only 40%, because starting a thread is rather expensive. That SUM and PROD have the same limit does not look "highly optimized", because the point of break even would be reached for the more expensive product ealier. In opposite to this, the optimized BLAS libraries like ATLAS or MKL consider the cache sizes and costs of starting threads.

In older Matlab versions (2009a) the multi-threading was implemented cheaply, because the output of the threads was collected in the order the thread get ready. Then rounding errors could cause different results for sum depending on random factors.

By the way: Look at the timings under R2009a on the same machine:

 Elapsed time is 6.628261 seconds.
 Elapsed time is 10.488722 seconds.
 Elapsed time is 9.719854 seconds.
 Elapsed time is 11.743312 seconds.

As in R2016b the 88999 elements cause 50% processor load, and the 89000 100%, but the code was not faster, but substantially slower. Scary.

Iniciar sesión para comentar.

Is vectorization preserve scaling of runtime?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

Is vectorization preserve scaling of runtime?

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo