
Usar cudaMallocHost().Ģ =21569= NVPROF is profiling process 21569, command. Es lenta! Se puede llegar a 4 GiB/s con PCIe 2.0 16x.Revisar el tamaño de las grillas y bloques.

16 DSMem: Dynamic shared memory allocated per CUDA block. 15 SSMem: Static shared memory allocated per CUDA block.
#Cudalaunch nvprof driver
This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. 21 22 23 Breakpoint 1, 0x00000002009f50c8 in ma4()> () 24 (cuda-gdb) step 25 Single stepping until exit from function _Z3ma4v, which has no line number information.

6 (cuda-gdb) l 7 16 c = (float)threadIdx.x+blockIdx.x 8 17 } 9 18 10 19 _global_ void ma4(void) 14 (cuda-gdb) break ma4() 15 (cuda-gdb) run 16 17 18 Breakpoint 1, 0x00000002009f50c8 in ma4()> () 19 (cuda-gdb) step 20 Single stepping until exit from function _Z3ma4v, which has no line number information.

1 $ nvcc -g -arch =sm_52 -ptxas-options =-v -compiler-options "-O3 -mcmodel=medium" ma4.cuģ NVIDIA (R) CUDA Debugger 4 7.5 release 5.
