Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
Hands On Large Language Models
This has been a better and much more comprehensive read than I expected.
Notes:
Tokenizers
- A comparison of multiple generations of tokenizers and the different choices between them, all the way from LLaMa to BERT to GPT4. (Increasing vocab sizes, better code tokenization handling, custom tokens, etc.)
- The primary decisions for a tokenizer are: parameters -- vocab sizes; capitalization, dataset it's trained on, algorithm: bpe, word piece, sentencepiece.
- Can be used to make contextualized embeddings for recommendation, summarization, ranking, etc.
- Word2Vec uses skip gram, negative sampling and contrastive training
— Kunal