Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

Lecture 2: PyTorch, resource accounting, etc.

2025-12-28
lecture
train a 70b parameter model on 15t tokens on 1024 h100s
- total flops needed: 6 * 70b * 15t (6 is the magic)
- h100 is 1979e12 (/2 because the listed number is with sparsity)
  - hopper data sheet
- mfu is .5
- napkin math spreadsheet

Memory accounting

memory use
- 4 bytes param
- 4 bytes gradient
- 4 + 4 opt state for adamw
digging into transformers [I need to do this explicitly]
- assignment 1 & handout
- mathematical description
- illustrated transformer, gpt2
- next course
memory usage
- tensors
- float32 / fp32 / single precision
- exponent 8 bit, fraction 23 bit, sign 1 bit
- gpt3 one matrix in feedforward: 2.3gb
- float16 / half precision
- exponent 5 bit / 10 bit fraction / 1 bit sign
- dynamic range is pretty bad
- large models can have instability with under/overflow with this
- bfloat16 (2018, Google)
- brain float, by google brain
- 8 bit exponent, 7 bit fraction, sign
- same dynamic range as float32
- torch.finfo
- typically used for computations as it's good enough
- optimizer states and params still need float32
- fp8 (2022, nvidia)
- very crude: e4m3 and e5m2 options
- -448, 448 or -57344,57344
- supported by h100s
- float32 is too expensive
- generally use mixed precision
- want higher precision fgor something that's accumulated over time
- test with torch.cuda.memory_allocated
- tensors are pointers to memory with a way to index into the matrix
- be aware of tensor views
- untyped_tensor().data_ptr()
- is_contiguous()
- transpose: not contiguous anymore, cannot take more views
- making it contiguous() will force a copy
- elementwise operations create new tensors
- triu is good for causal attention mask
- matmul
- when multiplying with not matching matrices, it'll just iteratate over the missing dimensions
einops
- name the dimensions explicitly
- einstein summation notation
- jaxtyping specifies dimensions in types
- x: Float[torch.Tensor, "batch seq head hidden"] -- just documentation
- can add a checker to assert
- einops -- named dimensions are iterated over, the other dimensions are summed over
- can also do reduce, rearrange

Compute Accounting

flops
- addition, multiplication
- FLOPs: floating point operations
- FLOP/s = FLOPS: floating point operation per second
intuition
- gpt3 pi e21 flops
- gpt4 2e25 flops
- 1e26 was gov regulations

Linear model

n points, each with d dims
map to a k dim vector
matmul: every i,j,k triple: one multiplication and one addition
2 times product of all dimensions
crude estimate for order of magnitude
hardware is designed for large matrix multiplication
generally only consider regimes where models are dominant
wall clock time
MFU: model flops utilization -- actual flop per sec / promised flop per second
= .5 is quite good
dominated by matmul
ignores all communication / overhead
gradients: 4 * total parameters
that's why roughly in an NN total is 6 * total params
this is the bulk of the computation for many models
works for most standard models

Models

initialization
- on initializatio, large values can make gradients blow up
- scale it down by np.sqrt(input dimension)
- Xavier initialization
- can also truncate for extra safety
randomness
- shows up in many places
- init, dropout, data ordering
- fix the seed for each source of randomness
  - allows fixing one part
- torch.manual_seed, np.random.seed, random.seed
data loading
- don't load data into memory at once
- use np.memmap mapped to file loading on demand
- can have a dataloader that loads it meaningfully
optimizer
- momentum = SGD + exponential averaging of grad
- adagrad = sgd + averaging by grad ^ 2
- rmsprop = adagrad + exponentially averaging grad ^ 2
- Adam = rmsprop + momentum
optimizer memory
- num of params, activations, gradients, optimizer states
checkpointing
- save model, optimizer state to disk
- save progress in data as well
- pytorch amp library

follow ups
- adam and optimizer states
- render fp8, fp4
- einops tutorial
- translate flops to real times