Setup DimGrid and DimBlock in CUDA -
i doing matrix multiplication in cuda. following setup works:
int tile = 8; dim3 dimgrid((numccolumns - 1)/tile + 1, (numcrows - 1)/tile + 1, 1); dim3 dimblock(tile, tile, 1);
but if use 1 block whole image, returns zero. reason that? assume 1 block can contain whole image ( input 64x64).
dim3 dimgrid(1,1,1); dim3 dimblock(numccolumns, numcrows, 1);
this how call kernel in main function:
matrixmultiply<<<dimgrid, dimblock>>>(devicea, deviceb, devicec, numarows, numacolumns, numbrows, numbcolumns, numcrows, numccolumns);
and kernel:
__global__ void matrixmultiply(float * a, float * b, float * c, int numarows, int numacolumns, int numbrows, int numbcolumns, int numcrows, int numccolumns) { //@@ insert code implement matrix multiplication here int row = blockidx.y * blockdim.y + threadidx.y; int col = blockidx.x * blockdim.x + threadidx.x; if ((row < numcrows) && (col < numccolumns)) { float value = 0.0; (int = 0; < numacolumns; i++) value += a[row * numacolumns + i] * b[i*numbcolumns + col]; c[row * numccolumns + col] = value; } }
but if use 1 block whole image, returns zero. reason that?
a cuda threadblock limited maximum of 1024 threads (refer "maximum number of threads per block "). multidimensional threadblock, means product of dimensions must less or equal 1024 (for cc2.x , newer gpus.)
for 64x64 image, not work:
dim3 dimblock(numccolumns, numcrows, 1);
since numccolumns
* numcrows
greater 1024.
if proper cuda error checking in code, you'll indication of (that kernel launch failing due invalid kernel configuration parameter).