"compute-bound" & "memory-bound" kernels
"compute-bound" kernel spends most of its time in calculating, not accessing memory.
"memory-bound" kernels is divided into two kinds:
a) "bandwidth-bound", the transfer between device and global memory nearly reaches the limitation;
b) "latency-bound", fetching from the memory is the bottleneck.
Please refer following diagram:
References:
stackoverflow;
What the profiler is telling you: optimizing gpu kernels.