Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

Assignment 1

Notes

154,998 characters across 168 scripts in Unicode 16.0
ord = char -> int; chr = int -> char
UTF-8 covers 90% of web pages
subword tokenization:
- bytes leads to massive sequences, increasing the computation needed at every step of the model
- word tokenizers lead to a lot of out-of-vocabulary words showing up in training
- trade off larger vocab for better compression
training a bpe tokenizer
- vocab from bytestring token to integer; start with 256 -- one for each byte
- pre-tokenize the corpus
  - reduce computation cost for merging
  - eliminate duplicate tokens that only differ in punctuation
  - gpt2's regex: PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
- do not consider pairs crossing pre-token boundaries for efficiency
- tie breaks is by choosing the lexicographicaly greater pair
- implementation
  - parallelize pre-tokenization
  - remove special tokens before pre-tokenization
  - incrementally update counts after merge

Set up

using git subtrees to get all the source
keeping my solutions separate outside this repo, and installing into it so that the adapters can work
unicode1
- a) \x00 = NUL or the null character
- b) __repr__ shows the unicode number, print shows nothing / empty
- c) Doesn't show up at all in the string when printed; still recorded in the actual string by Python.

>>> str(chr(0))
'\x00'
>>> print(chr(0))