Setup DimGrid and DimBlock in CUDA -
i doing matrix multiplication in cuda. following setup works:
int tile = 8; dim3 dimgrid((numccolumns - 1)/tile + 1, (numcrows - 1)/tile + 1, 1); dim3 dimblock(tile, tile, 1); but if use 1 block whole image, returns zero. reason that? assume 1 block can contain whole image ( input 64x64).
dim3 dimgrid(1,1,1); dim3 dimblock(numccolumns, numcrows, 1); this how call kernel in main function:
matrixmultiply<<<dimgrid, dimblock>>>(devicea, deviceb, devicec,                                         numarows, numacolumns,                                         numbrows, numbcolumns,                                         numcrows, numccolumns); and kernel:
__global__ void matrixmultiply(float * a, float * b, float * c,                    int numarows, int numacolumns,                    int numbrows, int numbcolumns,                    int numcrows, int numccolumns) {     //@@ insert code implement matrix multiplication here     int row = blockidx.y * blockdim.y + threadidx.y;     int col = blockidx.x * blockdim.x + threadidx.x;      if ((row < numcrows) && (col < numccolumns))     {         float value = 0.0;         (int = 0; < numacolumns; i++)             value += a[row * numacolumns + i] * b[i*numbcolumns + col];         c[row * numccolumns + col] = value;     } } 
but if use 1 block whole image, returns zero. reason that?
a cuda threadblock limited maximum of 1024 threads (refer "maximum number of threads per block "). multidimensional threadblock, means product of dimensions must less or equal 1024 (for cc2.x , newer gpus.)
for 64x64 image, not work:
dim3 dimblock(numccolumns, numcrows, 1); since numccolumns * numcrows greater 1024.
if proper cuda error checking in code, you'll indication of (that kernel launch failing due invalid kernel configuration parameter).