Grid-Stride Loops

In summary, grid-stride loops don't assume that the thread grid is large enough to cover the entire data array, this kernel loops over the data array one grid-size at a time. So this method has better scalability:

__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

Or

__global__
void saxpy(int n, float a, float *x, float *y)
{
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; 
         i < n; 
         i += blockDim.x * gridDim.x) 
      {
          y[i] = a * x[i] + y[i];
      }
}

References:
An Even Easier Introduction to CUDA;
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops.

results matching ""

    No results matching ""