Setup DimGrid and DimBlock in CUDA -


i doing matrix multiplication in cuda. following setup works:

int tile = 8; dim3 dimgrid((numccolumns - 1)/tile + 1, (numcrows - 1)/tile + 1, 1); dim3 dimblock(tile, tile, 1); 

but if use 1 block whole image, returns zero. reason that? assume 1 block can contain whole image ( input 64x64).

dim3 dimgrid(1,1,1); dim3 dimblock(numccolumns, numcrows, 1); 

this how call kernel in main function:

matrixmultiply<<<dimgrid, dimblock>>>(devicea, deviceb, devicec,                                         numarows, numacolumns,                                         numbrows, numbcolumns,                                         numcrows, numccolumns); 

and kernel:

__global__ void matrixmultiply(float * a, float * b, float * c,                    int numarows, int numacolumns,                    int numbrows, int numbcolumns,                    int numcrows, int numccolumns) {     //@@ insert code implement matrix multiplication here     int row = blockidx.y * blockdim.y + threadidx.y;     int col = blockidx.x * blockdim.x + threadidx.x;      if ((row < numcrows) && (col < numccolumns))     {         float value = 0.0;         (int = 0; < numacolumns; i++)             value += a[row * numacolumns + i] * b[i*numbcolumns + col];         c[row * numccolumns + col] = value;     } } 

but if use 1 block whole image, returns zero. reason that?

a cuda threadblock limited maximum of 1024 threads (refer "maximum number of threads per block "). multidimensional threadblock, means product of dimensions must less or equal 1024 (for cc2.x , newer gpus.)

for 64x64 image, not work:

dim3 dimblock(numccolumns, numcrows, 1); 

since numccolumns * numcrows greater 1024.

if proper cuda error checking in code, you'll indication of (that kernel launch failing due invalid kernel configuration parameter).


Popular posts from this blog