Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

Cuda Programming Guide

(Working through the revamped cuda programming notes in parallel with PMPP)

Introduction

Compute Unified Device Architecture (CUDA), 2006-
Much higher instruction throughput & memory bandwidth than a cpu for similar price & power envelope
- FPGAs are less flexible but mroe efficient
- lower single threaded perf to achive greater throughput
Chip resource comparison
- dram -> equal
- Per core l1 ccahe, l2 cache, controll, l3 cache etc. in cpu
- GPU transistors mostly devoted to the compute cores
Libraries: cuBLAS, cuFFT, cuDNN, CUTLASS; Warp, Triton
Programming model
- heterogenous system assumed mixing gpus and cpus
- cpu = host, host memory, host code; gpu = device, device memory, device code
- always start on cpu, can execute some part of the code on gpu
- maximize perf across both
- kernel is a function invoked for execution on gpus
- kernel launch = starting several threads executing kernel code in parallel
cpu <-> gpu is generally a PCIe or NVLINK interconnect
hardware model
- gpu is a collection of streaming multiprocessors organized into groups of graphics processing clusters
- each SM has
  - unified data cache: shared memory and l1 cache; allocation is runtime configurable
  - number of functional units for computations
- thread blocks (1,2,3d; simplifies mapping threads to units of work)
  - organizational unit for threads
  - organized into a grid
  - all blocks in a grid have the same size and dimensions
- kernel launch has an execution configuration specifying grid & block dimensions
  - relies on built in variables to identify location
- all threads in a block are executed in a sinle SM, having access to on chip shared memory
- grid can have millions of blocks (>> SMs)
  - no guarantee of scheduling between blocks
  - all threads of a thread block are executed by a single SM, generally run to completion on that SM
- cuda programming model allows for arbitrarily large grids on any size of sms:
  - no data deps between threads in different thread blocks
  - requires that thread blocks can be executed in any order in parallel or in series
- thread block clusters optional grouping for compute capability 9.0+
  - all clusters are executed in a single graphical processing cluster
  - within the same cluster they can communicate using cooperative groups
  - clusters can access shared memory of all blocks in the cluster: distributed shared memory
- Within a thread block: 32 thread group == warp
  - executes kernel code in SIMT == Single-Instruction Multiple Threads
  - all threads are executing the same kernel code, though maybe different branches
  - assigned a warp lane (0-31) -- threads are assigned predictably
  - all execute the same instruction simultaneously
    - if only some are following a branch, the others are masked off
    - warp divergence
    - perf maximization with all threads doing the same control flow
  - thread blocks should be sized as 32x to avoid wasting lanes
GPU Memory
- dram attached to gpu == global memory
- dram attached to cpu == system/host memory
- CPU and GPU use a single unified virtual memory space
- on chip memory
  - shared memory + registers per sm
  - registers are generally thread local vars assigned by compiler
  - shared memory is for all threads in the block/cluster
- total registers per thread * number of threads per block <= registers in SM
- L1 cache -- per SM as part of unified data cache
- constant cache -- per SM to cache values declared constant over lifetime of kernel
- L2 cache -- shared across SMs
Unified memory
- cuda feature to make memory available on demand
- should minimize memory migration regardless
CUDA Platform
- Compute Capability per gpu, indicates which features are supported and specifies hardware parameters
- List maintained here, compatibility
- Directly corresponds to SM version; eg. gpu cc 12.0 -> sm_120
- NVIDIA Driver must be installed on the OS, r580 etc.
- Cuda Toolkit is a set of libraries, headers, and tools for writing, building, analyzing software using gpus
- Cuda Runtime is a special case from the toolkit: provides both api and language extensions for common tasks (ontains CUDA runtime api)
- Runtime API is implemented on top of the Driver API
PTX (parallel thread execution)
- ptx virtual instruction set - high level assembly language abstracting over physical isa
- corresponds to a compute capability, eg. compute_80
- Cubin (cuda binary)
  - specific binary format
  - higher level language is compiled to PTX
  - PTX is compiled into a real binary
- Fatbin
  - gpu code is stored in this container
  - contains cubins and ptx for multiple targets
  - binary compatibily works within major versions, but not across
- Compat
  - ptx can be jit compiled at runtime for any compute capability >= ptx code
  - jit: device driver jit compiles ptx, and then caches a copy
NVRTC do compilation at runtime

Cuda programming

Kernels
- __global__ modifier to make a kernel, allow it to be invoked from a kernel launch
- void return type
- launch with
  - cudaLaunchKernelEx
  - triple chevron notation, somekernel<<<grid, thread block>>>
  - thread block can have up to 1024 threads
- launches are asynchronous with respet to the host
- dim3 is used for 2 or 3d grids
- intrinsic parameters, each with .x, .y, .z
  - threadIdx
  - blockDim
  - blockIdx
  - gridDim
- workIndex = threadIdx.x + blockDim.x * blockIdx.x
- Cuda Core Compute Lbrary (CCCL) has cuda::ceil_div
Memory
- unified memory: cudaMallocManaged; some linux systems do this automatically
  - or use __managed__
- explicit memory: cudaMallocHost(ptr, size), cudaMalloc(ptr, size)
- cudaMemcpy copies data between devices, cudaMemcpy(ptr, ptr, size, direction)
  - cudaMemcpyDefault figures out which copy to make
  - synchronous
Synchronization
- cudaDeviceSynchronize for all work
- can use stream synchronization or events for partial sync instead
Misc: had to use nvcc -arch=<> to compile successfully, relied on claude
- kernel launch errors have a different api to check for errors
__syrcthreads synchronizes threads within a block
cooperative grousp for broader synchronization
cuda context -- primary context for the device, initialized at first runtime function that needs an active context
- this also dose any jit compilation needed
- available from driver api
- cudaInitDevice, cudaSetDevice initialize runtime and primary context
- cudaDeviceReset destroys primary context
- cuda interface relies on global state set up & destroyed at "program initiation" & "destruction"
Error checking
- every cuda api returns cudaError_t: always check and manage return value

#define CUDA_CHECK(expr_to_check) do {                      \
  cudaError_t result = expr_to_check;                       \
  if (result != cudaSuccess) {                              \
    fprintf(stderr, "CUDA Runtime Error: %s:%i:%d = %s\n",  \
    __FILE__, __LINE__, result,cudaGetErrorString(result)); \
  }                                                         \
} while(0)

CUDA_CHECK(cudaGetLastError()); for checking async launches
- get last error clears state
CUDA_LOG_FILE prints errors with details more explicitly, I'm going to always set this
- can be stdout or stderr
- also exists with an error log management feature
device + host functions
- __global__ indicates entry point for a kernel
- can launch kernel from within another kernel using dynamic parallelism
- __device__ -- compile for gpu, callable from othre device or global functions
- variables: __device__, __constant__, __managed__, __shared__
- use __CUDA_ARCH_ to check if compiling for GPU inside a function
thread block clusters
- compute 9.0+
- blocks in a block cluster are scheduled in the same gpu processing cluster
- up to 8 blocks in a cluster
- have access to distributed shared memory
- add a __cluster_dims__ annotation to launch to a cluster
- grid size is unchanged, and must be a multiple of cluster size
kernel writing
- threads > thread blocks > grid
- dimensios are just for convenience
- x moves fastest, followed by y and z for linearization
memory spaces

memory type	scope	lifetime	location	notes
global	grid	application	device	primary memory, careful about data races
constant	grid	application	device	- `__constant__` specifier outside any function, typically 64kb
shared	block	kernel	sm	- uses same resource as l1 cache, user scratchpad;
				- get device properties for size
				- `cudaFuncSetCacheConfig` to customize allocation
				- static: `__shared__ float sharedArray[1337]`
				- dynamic: `extern __shared__ float sharedArray[]`
				+ `fn<<<grid, block, sharedmembytes>>>`
				- must be manually partitioned & aligned for multiple
local	thread	kernel	device	- physically in global space
				- consecutive 32 bit words are accessed by consecutive thread ids
				- accesses are coalesced if threads access same relative addrs
register	thread	kernel	sm	- managed by compiler; `regsPerMultiprocessor`, `regsPerBlock`

cache
- l1 is on sm
- l2 is on device and shared; l2CacheSize property
texture/surface memory -- for graphics
distributed shared memory
use cooperative groups header for using clusters
- TODO follow up when i have more gpu access
Memory Performance
- global memory is read with 32 byte transactions
- the warp coalesces all requests from that warp into memory transactions
  - ideally warp should use everything
- maximize bytes used to transfarred
- shared memory has 32 banks with successive words mapping to successive banks
  - bank has a bandwidth of 32 bits per clock cycle
  - if multiple threads try to hit the same bank conflicts happen, serializing data access
    - except if multiple threads are accessing the same location: word is broadcast or
    - only written to by one of the threads