Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
Upgrading the cuda programming model
Make it simpler & more accessible to program gpus
Allow more languages: particularly Python
Stable abstraction for tensor cores; day 0 compat with future gpu architectures
SIMT: blocks of warps, warps of threads, then running in threads
Tile based model
SIMT: every thread gets some elemnts of the block
TILE: just dispatch each tile as a single array, handled by the system apping to threads
SIMT: have to explicitly think about how we handle dram access, compute strategy, warp specialization
Cuda tile is responsible for making blocks, dividing data into tiles
(Discussion) -- hints can become weird
the system is full of performance cliffs
nvidia can do a better job at providing predictability by not raising the level of abstraction
created a new cuda tile IR for handling nvidia gpus
tile ir provides a stable, portable way to target tensor cores
open sourced as an MLIR
positioned like PTX as another part of cuda -- inside the platform
drivers can jit it
lots of iteration on this with lessons learned from previous attempts
cuTile -> TileIR
Triton -> TileIR
programming model is different from triton
could also use numba
cutile -- python instantiation of tile
programming model shift is the important bit
think of tiles as local registers
updating them rebinds the name, doesn't publish to memory
can have simt and tile kernels running simultaneously
design: ability to annotate certain device functions and reexport them
ptx has a few % overhead compare to sass
cutile has tradeoffs
by steady state next year, will support ampere, hopper; etc.
cutile github repo to follow alongI tihn
can see the IR from cutile
everything
seems very new, will have to spend some time working with it after I have easy access to hardware that can use it
also coming to rust
bytecode can be exported and included in fatbin files language free