Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
(Working through the revamped cuda programming notes in parallel with PMPP)
Compute Unified Device Architecture (CUDA), 2006-
Much higher instruction throughput & memory bandwidth than a cpu for similar price & power envelope
Chip resource comparison
Libraries: cuBLAS, cuFFT, cuDNN, CUTLASS; Warp, Triton
Programming model
kernel is a function invoked for execution on gpuscpu <-> gpu is generally a PCIe or NVLINK interconnect
hardware model
gridthread block clusters optional grouping for compute capability 9.0+
warp
GPU Memory
Unified memory
CUDA Platform
PTX (parallel thread execution)
compute_80NVRTC do compilation at runtime
Kernels
__global__ modifier to make a kernel, allow it to be invoked from a kernel launchvoid return typecudaLaunchKernelExtriple chevron notation, somekernel<<<grid, thread block>>>dim3 is used for 2 or 3d grids.x, .y, .z
threadIdxblockDimblockIdxgridDimworkIndex = threadIdx.x + blockDim.x * blockIdx.xcuda::ceil_divMemory
cudaMallocManaged; some linux systems do this automatically
__managed__cudaMallocHost(ptr, size), cudaMalloc(ptr, size)cudaMemcpy copies data between devices, cudaMemcpy(ptr, ptr, size, direction)
cudaMemcpyDefault figures out which copy to makeSynchronization
cudaDeviceSynchronize for all workMisc: had to use nvcc -arch=<> to compile successfully, relied on claude
__syrcthreads synchronizes threads within a block
cooperative grousp for broader synchronization
cuda context -- primary context for the device, initialized at first runtime function that needs an active context
cudaInitDevice, cudaSetDevice initialize runtime and primary contextcudaDeviceReset destroys primary contextError checking
cudaError_t: always check and manage return value#define CUDA_CHECK(expr_to_check) do { \
cudaError_t result = expr_to_check; \
if (result != cudaSuccess) { \
fprintf(stderr, "CUDA Runtime Error: %s:%i:%d = %s\n", \
__FILE__, __LINE__, result,cudaGetErrorString(result)); \
} \
} while(0)
CUDA_CHECK(cudaGetLastError()); for checking async launches
CUDA_LOG_FILE prints errors with details more explicitly, I'm going to always set this
stdout or stderrdevice + host functions
__global__ indicates entry point for a kernel__device__ -- compile for gpu, callable from othre device or global functions__device__, __constant__, __managed__, __shared____CUDA_ARCH_ to check if compiling for GPU inside a functionthread block clusters
__cluster_dims__ annotation to launch to a clusterkernel writing
memory spaces
| memory type | scope | lifetime | location | notes |
|---|---|---|---|---|
| global | grid | application | device | primary memory, careful about data races |
| constant | grid | application | device | - __constant__ specifier outside any function, typically 64kb |
| shared | block | kernel | sm | - uses same resource as l1 cache, user scratchpad; |
| - get device properties for size | ||||
- cudaFuncSetCacheConfig to customize allocation |
||||
- static: __shared__ float sharedArray[1337] |
||||
- dynamic: extern __shared__ float sharedArray[] |
||||
+ fn<<<grid, block, sharedmembytes>>> |
||||
| - must be manually partitioned & aligned for multiple | ||||
| local | thread | kernel | device | - physically in global space |
| - consecutive 32 bit words are accessed by consecutive thread ids | ||||
| - accesses are coalesced if threads access same relative addrs | ||||
| register | thread | kernel | sm | - managed by compiler; regsPerMultiprocessor, regsPerBlock |
cache
l2CacheSize propertytexture/surface memory -- for graphics
distributed shared memory
use cooperative groups header for using clusters
Memory Performance