Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

Hands on Large Language Models

This has been a better and much more comprehensive read than I expected.

Notes:

Tokenizers

A comparison of multiple generations of tokenizers and the different choices between them, all the way from LLaMa to BERT to GPT4. (Increasing vocab sizes, better code tokenization handling, custom tokens, etc.)
The primary decisions for a tokenizer are: parameters -- vocab sizes; capitalization, dataset it's trained on, algorithm: bpe, word piece, sentencepiece.
Can be used to make contextualized embeddings for recommendation, summarization, ranking, etc.