Writing an LLM from scratch, part 2
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and planning to post every day (or at least, every day I read some of it -- Christmas day I suspect I'll not be posting) with notes on what I found interesting.
I had been planning to do a chapter a day, but that is looking optimistic for such a dense book! So today, I've read the first half or so of Chapter 2, "Working with text data". This gives an overview of the pre-processing that happens to text before it hits the LLM, goes on to describe a simple tokenization system (complete with source code), and then briefly covers the byte pair encoding method that we'll actually be using for the LLM.
The overview
The core idea here is that LLMs cannot actually process text directly -- not even words. I think it's pretty common knowledge that the starting point for inputting text to one is to tokenize it (there was a lot of that during my adventures with fine-tuning), but it turns out that that is only the first step.
The process is:
- Get the raw text, which is a series of characters (the purist in me wants to either say "bytes" here, or to add a preliminary step which is "work out what encoding was used for the text and convert it into something consistent across documents").
- Tokenize that, so that you get a series of token IDs -- integers, one per word or sub-word.
- For each token, generate an embedding.
...and then it's the embeddings that are actually sent to the LLM.
Introduction to word embeddings
The embeddings in question are just one-per-token, not the big embeddings of a whole sentence or document that were being used in the encoder-decoder transformer presented in the first chapter. But they're still just high-dimensional vectors that represent in some abstract way the meaning of the tokens in question.
Rashka gives as an example
the Word2vec embedding that was quite big news a while back, because
you could do arithmetic with the vectors it produced and get reasonable results.
For example, using w2v
as an imaginary function that takes a word and returns
the embedding vector, and treating +
and -
as element-wise operators on vectors,
we can see stuff like this:
w2v("king") - w2v("man") + w2v("woman") ~= w2v("queen")
or:
w2v("Paris") - w2v("France") + w2v("Germany") ~= w2v("Berlin")
...but more problematically:
w2v("doctor") - w2v("man") + w2v("woman") ~= w2v("nurse")
...which made it clear that keeping training material bias-free was going to be important.
Anyway, he notes that the embeddings used for LLMs are generally not pre-created ones like Word2vec but are instead generated by embedding engines that are trained alongside the LLM itself. One surprise for me was how many dimensions these have -- he says 768 dimensions for GPT-2 but a massive 12,288 dimensions for GPT-3!
He also briefly touches on the existence of embedding models for audio and video, and while he doesn't mention it explicitly, that suggests to me how the multi-modal LLMs that we have now might work. After all, if an LLM just takes embeddings as an input, then so long as they're compatible in some sense, there's no reason why it might not reason about audio embeddings or image embeddings just like it does about word embeddings. That's not much more than speculation on my part, though.
Something else that occurred to me while reading this is that if an LLM is working specifically with embeddings for its calculations, it seems plausible that the immediate output might be an embedding for the next token rather than the token itself. So we'd somehow need to map from embedding to token, which would be hard (finding the nearest token to a random embedding in a space with thousands of dimensions is an expensive operation) so perhaps that symmetry embedding -> LLM -> embedding is just a mistake on my part, and the LLM really does just produce a next token (or rather, a set of tokens with associated probabilities). We'll see!
Tokenization: writing your own
This part was something with less new stuff for me -- the concept of tokenization, converting words (or parts of words) into numbers is something that I think anyone reading this will be familiar with, and the implementation that Raschka works through is a pretty simple and clear one.
For the first versoin, we split text into words and punctuation, eg.
"Hello, how are you?" -> ["Hello", ",", "how", "are", "you", ?"]
(He notes that whitespace can be kept if required for the use case, eg. Python code.)
Once you've done that, assign a unique integer for each unique word/punctuation mark, and build a mapping from one to the other.
With that, you can build a simple class with an encode method to map text to lists of token IDs and a decode method to convert such a list back to text.
However, if it encounters a word it hasn't seen before, it will break. So for the
second version, we enhance it to handle unknown words. He does this by adding the string
"<|unk|>"
to the vocabulary before generating the word-to-ID mapping, so it gets
an ID, and then modifying the encoder so that if it encounters a word that it doesn't
recognise, it will output that token. He also adds on an "<|endoftext|>"
token
so that different documents in the same input stream can be separated.
This is a good solution, and works well, but the impression I've got from reading around previously is that it's not current best practice (as in, as of GPT-4). I imagine it's the system used here because all Raschka is trying to achieve with this code is an example of how tokenization works -- that is, the code is a simple implementation for pedagogical purposes.
But I'll explain what I understand to be the problem with it anyway, mostly as a way to make sure I really do understand it properly :-)
If "<|endoftext|>"
appears as a string in a vocabulary for a naive encoder that maps from words to token
IDs, then someone could put that exact string into a prompt for an LLM application,
and get that token. I'm pretty sure I remember some early prompt injections/jailbreaks
for the original launch of ChatGPT that used tricks that appeared to work along those lines.
There are ways to avoid that (see the next section for a good example), but it has the feel of something that someone sufficiently talented at jailbreaking might be able to work around. Whatever part of my brain sets off alarms at potential SQL injections in website code is certainly getting triggered when I see it.
From what I've gatherered (and I may be completely wrong on this), the latest
tokenizers don't have any specific string mapping to the special tokens -- that is, they're
just token IDs. There would presumably be some way to output them in the decode
method, but there would be no way to generate one by using specially formatted
text in the encode
method. So, for example, the only way to generate the token ID reserved for
unknown words would be to provide an unknown word.
Anyway, all of this is moot, because the tokenizer we'll be using for the LLM itself is not the one above, but something more sophisticated.
Tokenization: byte pair encoding
The problem with using tokenization with a fixed vocabulary like the one above is that either:
- You wind up having to scan your whole training set to find all of
the unique words (if you're training on a scrape of the whole Internet you're
going to find all kinds of things like
lksfdklkajfdfklj
). You get a massive vocabulary with a bunch of almost-never-used words, and yet, after training, you'll still have problems with inputs that contain words that you didn't see in training. - Or you just accept that there will be unknown words during training too, which limits what your LLM can learn.
We can avoid that problem with a more sophisticated tokenizer. Rashka gives some sample code using OpenAI's tiktoken library, which uses byte-pair encoding to do tokenization. Byte-pair encoding is a system where the tokenizer has learned its own set of tokens through a training process -- that is:
- It started off with tokens for all of the letters, numbers and punctuation marks
- It was then shown a lot of data, spotted common combinations of existing tokens, and created new tokens for them
Unknown words can be represented by the tokens that it does have -- in the worst case, it just needs to spell them out letter-by-letter.
The sample code he shows gives an example:
In [64]: import tiktoken
In [65]: tokenizer = tiktoken.get_encoding("gpt2")
In [66]: text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
In [67]: integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
In [68]: tokenizer.decode(integers)
Out[68]: 'Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.'
I found it particularly interesting to see what the specific tokens were:
In [69]: [tokenizer.decode([ii]) for ii in integers]
Out[69]:
['Hello',
',',
' do',
' you',
' like',
' tea',
'?',
' ',
'<|endoftext|>',
' In',
' the',
' sun',
'lit',
' terr',
'aces',
' of',
' some',
'unknown',
'Place',
So you can see that "someunknownPlace"
was broken up into the separate tokens for
"some"
, "unknown"
, and "Place"
That also highlights something I'd gathered previously from reading elsewhere -- modern
tokenisers often include leading spaces as part of the token itself, so -- for example -- there are
different tokens for " do"
and "do"
-- they have different IDs:
In [70]: print(tokenizer.encode("do do"))
[4598, 466]
...and when decoded look different:
In [71]: [tokenizer.decode([ii]) for ii in tokenizer.encode("do do")]
Out[71]: ['do', ' do']
There's also that allowed_special
parameter in the encode
call, which is a
precaution to prevent the kind of jailbreaking issue I mentioned earlier.
With it, it will happily parse the text as a special token:
In [72]: [tokenizer.decode([ii]) for ii in tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})]
Out[72]: ['<|endoftext|>']
Without it, the tokenizer recognizes it and rejects it:
In [73]: [tokenizer.decode([ii]) for ii in tokenizer.encode("<|endoftext|>")]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[73], line 1
----> 1 [tokenizer.decode([ii]) for ii in tokenizer.encode("<|endoftext|>")]
File ~/.virtualenvs/llm-from-scratch/lib/python3.12/site-packages/tiktoken/core.py:117, in Encoding.encode(self, text, allowed_special, disallowed_special)
115 disallowed_special = frozenset(disallowed_special)
116 if match := _special_token_regex(disallowed_special).search(text):
--> 117 raise_disallowed_special_token(match.group())
119 try:
120 return self._core_bpe.encode(text, allowed_special)
File ~/.virtualenvs/llm-from-scratch/lib/python3.12/site-packages/tiktoken/core.py:398, in raise_disallowed_special_token(token)
397 def raise_disallowed_special_token(token: str) -> NoReturn:
--> 398 raise ValueError(
399 f"Encountered text corresponding to disallowed special token {token!r}.\n"
400 "If you want this text to be encoded as a special token, "
401 f"pass it to `allowed_special`, e.g. `allowed_special={{{token!r}, ...}}`.\n"
402 f"If you want this text to be encoded as normal text, disable the check for this token "
403 f"by passing `disallowed_special=(enc.special_tokens_set - {{{token!r}}})`.\n"
404 "To disable this check for all special tokens, pass `disallowed_special=()`.\n"
405 )
ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.
...or, as per that (excellently detailed) error message, you can get it to be parsed as just a sequence of characters with no special meaning:
In [77]: [tokenizer.decode([ii]) for ii in tokenizer.encode("<|endoftext|>", disallowed_special=(tokenizer.special_tokens_set - {'<|endoftext|>'}))]
Out[77]: ['<', '|', 'end', 'of', 'text', '|', '>']
That last one feels a little risky! In a larger system (like some kind of framework with LLMs talking to each other) it's easy to imagine that stream of tokens being reassembled by naive code and then tokenized as if it were trusted.
Summary
Anyway, I felt that this was quite enough for today. Interestingly, I think I've spent about twice as long typing up these notes as I did reading it, but perhaps that's a good balance. I'm certainly sure that I'll remember what I've read much better as a result of the writeup!
The next bit, which I expect to read tomorrow (though it's Christmas Eve, so I may not manage) is on data sampling, followed by creating the embeddings. I was expecting the latter to be pretty complicated, but on a quick scan ahead I see that what all we're doing at this stage is generating random embeddings, which makes sense -- the actual values will be learned when we get on to training.
Nits and oddities
Only one thing today -- it really is me nit-picking, but on p28 the output of a command
print(tokenizer.decode(ids))
is rendered as
'" It\' s the last to be painted, you know," Mrs. Gisburn said with pardonable pride.'
It should be
" It' s the last to be painted, you know," Mrs. Gisburn said with pardonable pride.
-- my guess is that it was copy-pasted from a CLI session where the command was
just tokenizer.decode(ids)
, so Python provided a repr.
Basically a typo and I feel kind of bad about even mentioning it ;-)