Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
Assignment 1
Notes
- 154,998 characters across 168 scripts in Unicode 16.0
ord = char -> int; chr = int -> char
- UTF-8 covers 90% of web pages
- subword tokenization:
- bytes leads to massive sequences, increasing the computation needed at every step of the model
- word tokenizers lead to a lot of out-of-vocabulary words showing up in training
- trade off larger vocab for better compression
- training a bpe tokenizer
- vocab from bytestring token to integer; start with 256 -- one for each byte
- pre-tokenize the corpus
- reduce computation cost for merging
- eliminate duplicate tokens that only differ in punctuation
- gpt2's regex:
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
- do not consider pairs crossing pre-token boundaries for efficiency
- tie breaks is by choosing the lexicographicaly greater pair
- implementation
- parallelize pre-tokenization
- remove special tokens before pre-tokenization
- incrementally update counts after merge
Solutions
Set up
-
using git subtrees to get all the source
-
keeping my solutions separate outside this repo, and installing into it so that the adapters can work
-
unicode1
- a)
\x00 = NUL or the null character
- b)
__repr__ shows the unicode number, print shows nothing / empty
- c) Doesn't show up at all in the string when printed; still recorded in the actual string by Python.
>>> str(chr(0))
'\x00'
>>> print(chr(0))
- unicode2
- a) Prefering to train on utf-8 encoded bytes because
- most dominant encoding
- 1-4 bytes per charcter in utf-8
- utf-16 can be more expnesive and larger
- b) will break with multibyte characters, like "智" = b'\xe6\x99\xba'
- c) same as above