Synchronization
Launching kernel is asynchronous to host, so it means maybe you need to explicit synchronization (e.g., cudaDeviceSynchronize) or implicit (e.g., cudaMemcpy). You need to take care of the synchronization of Memcpy and Memset related functions, and for details, you can refer API synchronization behavior. Or you can refer below image: