PTX kernel time to run

Question

0 votos

Hello, i am using R2010b, CUDA toolkit 3.1 with a geforce gt425m. While is was optimalizing my cuda code i observed that calling the kernel with feval in matlab has a ~2ms constant time measured with

tic feval(k,...) toc

the kernel code:

    #define C_WIDTH 1024
    #define C_HEIGHT 768
    __global__ void timetest1(float* holo) {    
     int mindex=blockIdx.x*blockDim.x+threadIdx.x;
     int size=C_WIDTH*C_HEIGHT;
     if (mindex>=size) 
    return;
     holo[mindex]=mindex*mindex;
    }

Even if i take out the write to global memory //holo[mindex]=mindex*mindex; there is a ~2ms time

Does anybody know the origin of this lag? It would be great to somehow eliminate it.

Thanks,

Gaszton

PS: my matlab code for the kernel:

clear

import parallel.gpu.GPUArray

xsize=1024; ysize=768;

vectorsize=xsize*ysize; threadpblock=1024; k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); k.ThreadBlockSize=[threadpblock,1,1]; k.GridSize=[ceil(vectorsize/threadpblock),1];

dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');

tic [dholo]=feval(k,dholo); time=toc;

['ms time= ' num2str(time*1000)]

clear

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Iniciar sesión para seguir la actividad

Answer 1

Edric Ellis el 16 de Mayo de 2011

Abrir en MATLAB Online

0 votos

Firstly, can I suggest that if possible you should upgrade to R2011a as we have made quite a few performance improvements in that release. Secondly, I think the main bottleneck in your code as written is that outside a function, an important optimisation called "in-place optimisation" cannot take place. If you place your code inside a function, then "dholo" will not be copied. For reference, I made a function like this:

function tmp
import parallel.gpu.GPUArray
xsize=1024; ysize=768;
vectorsize=xsize*ysize; 
threadpblock=512; % I have a C1060
k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); 
k.ThreadBlockSize=[threadpblock,1,1]; 
k.GridSize=[ceil(vectorsize/threadpblock),1];
dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');
tic
for ii = 1:1000
    dholo=feval(k,dholo); 
end
time=toc;
disp(['ms time= ' num2str(time)])

And the overhead on my C1060 was down to 0.05 ms.

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Gaszton el 16 de Mayo de 2011

Thank you for your help!

I am a PhD student in Hungary, Biological Research Centre

Hungarian Academy of Sciences,

we have a network licence (with limited number of instances of matlab to run parallel)

We used to buy a matlab update in every 1-2 year, but i dont really have an impact on that.

thank you again,

Gaszton

Iniciar sesión para comentar.

PTX kernel time to run

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Más respuestas (0)

Categorías

Productos

Etiquetas

Community Treasure Hunt

PTX kernel time to run

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Más respuestas (0)

Categorías

Productos

Etiquetas

Ver también

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos