Writing an LLM from scratch, part 1

Posted on 22 December 2024 in Programming, Python, AI

Over the Christmas break (and probably beyond) I'm planning to work through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm expecting to get through a chapter or less a day, in order to give things time to percolate properly. Each day, or perhaps each chapter, I'll post here about anything I find particularly interesting.

Today, it was what is most likely the easiest bit; the introductory chapter 1, "Understanding large language models".

As you would expect, this is mostly reiterating things that anyone who's even tangentially connected to LLMs will already know, as Raschka needs to set the scene. However, there is a bit of information about some of the underlying technical concepts -- still quite hand-wavy stuff at this stage, as he has to introduce terms for concepts that will require whole chapters to explain later.

Transformers

The core message is that the transformer architecture is what makes LLMs so powerful -- he notes that LLMs don't have to be built on transformers, and transformer-based LLMs don't have to be GPTs, but says that he'll ignore that point for conciseness in the book, which is fair enough -- if he had to say "transformer-based LLM" or "GPT LLM" rather than just "LLM" every time, the book would be much harder to read. Not to mention the fact that "GPT" is closely linked to OpenAI in most people's minds.

Anyway, as you'd expect, the core of this chapter is a high-level overview of what transformers are. He takes an approach I like, which is to start off with the history. That's useful because a lot of the terminology used only really makes sense in the context of where it came from originally.

The first transformers were for machine translation, and worked like this:

The input text (in, say, English) went into an encoder. This was a large neural net that converted the text into an embedding -- a vector (or set of vectors) in a very-high-dimensional space that represented in some abstract way the meaning of the input.
This embedding was then fed into a decoder, which would use it to generate tokens, one-by-one, in another language (say, German) that expressed the concepts in the embedding -- and was therefore a translation of the original input. It would do this by firstly generating a first token, and then using that plus the embedding to generate the next, and then using those first two tokens plus the embedding to generate the third, and so on -- so it would always have its previous outputs visible to it to help guide the construction of the output.

Both the encoder and the decoder would have access to a self-attention system, which is a thing that allows them to know which other tokens to look at when considering a specific token; that is, if it's looking at "the fat cat sat on the mat" (and assuming one word == one token) it might want to look more closely at "fat" when looking at "cat" (because that's the noun the adjective modifies), or at "cat" and "mat" when looking at "sat" (because they're the subject and indirect object of the verb).

This self-attention mechanism is used by both the encoder (which obviously needs it because it's building something like a mental model of the meaning of the sentence in the form of the embedding) and the decoder (which needs it to construct valid sentences).

One thing that is unclear to me at this point is whether the encoder and the decoder share the same attention mechanism (allowing the encoder to pass "hints" to the decoder about how to build its output beyond what is in the embedding) or not. My guess is "not", because I'd expect it to have been mentioned.

Self-attention sounds like magic, and TBH is the one part of LLMs I'm really looking forward to learning about -- I have pretty much no mental model of how it works right now. But, of course, Raschka only mentions it briefly here -- chapter 3 will cover it in some detail.

Anyway, that is the historical setup for a transformer system. The next thing that people discovered was that the encoder and the decoder were actually useful on their own. I've heard models described as "encoder-only" or "decoder-only" -- GPT-based LLMs being one of the latter -- and this explains where that terminology came from.

An example of a encoder-only model is BERT, which I know is used heavily for classification tasks; indeed, Answer.AI recently announced ModernBERT, an updated version that is targeting exactly that. This makes some kind of intuitive sense to me. If you have something that is designed to take in text input and produce an embedding that represents its meaning (in some sense) then tacking on an extra layer to convert that meaning into "the tone of this text is happy/sad/angry/etc" doesn't feel like a huge lift.

What is less obvious to me is how an encoder like that might be used to do what these models are also good at: filling in the blanks in a sentence. Raschka shows one taking

This is an ___ of how concise I ___ be

...and predicting that the missing words are "example" and "can". I know that BERT and its ilk can do that, but I've never known exactly why, and sadly this chapter didn't help fill in that gap. I really really want to research it now, as I write this, but I should stay on track :-)

GPTs are, as I said above, an example of a decoder-only architecture. This is a bit hazier in my head than the classification use case for encoders. It's clear that something that can take an embedding and produce a series of words would be good for text generation. But GPTs work essentially by next-token prediction, eg. if they're provided with

This is an example of how concise I

...it might come back with "can", then when run again with

This is an example of how concise I can

...it might come back with "be".

So the question is, how does a decoder-only model like GPT work without having an embedding from an encoder to decode from?

One other thing that Raschka mentions that confuses me a little is that apparently the original transformer architecture had six encoder and six decoder blocks, and GPT-3 has 96 transformer layers. That doesn't fit very comfortably with my model of how this all works. Both encoders and decoders seem like stand-alone things that accept inputs (tokens/embeddings) and produce outputs (embeddings/tokens). What would you do with multiple layers of them?

So -- how to take a decoder designed to ingest embeddings and feed it text, how can decoder blocks be layered, and the attention mechanism -- those seem to me to be the things that this book will uncover for me. But at this early stage, we're really just learning about the problems that need to be solved rather than the solutions.

[Update, two hours after posting. This just popped into my mind: if a decoder is designed to take an embedding and the (initially empty) text so far, then by feeding it an "empty" embedding and some pre-written text, if properly trained it might be able to predict the next word. Is this perhaps how GPTs work?]

Data

One thing that is obvious from the get-go is that training an LLM is expensive -- you hear numbers in the hundreds of millions of dollars for recent OpenAI models. And from my own experiments in fine-tuning LLMs I know that training even a 0.5B model locally on 10,000 examples of about 1,000 tokens each (so, 10M tokens) takes about half an hour, while training a 8B model on a 8x A100 80GiB cloud server that costs ~US$10/hour takes 20 minutes for the same amount of data.

GPT3 was apparently trained on 300 billion tokens, so it's 300,000,000,000 / 10,000,000 = 30,000 times as much work. Let's naively scale my cost/time numbers up -- obviously this will be a very rough approximation of the real numbers, but should be within an order of magnitude or so:

The local train would take 15,000 hours, which is about 21 months.
The larger model would take 10,000 hours, which is a bit over 13 months, and would cost US$100,000.

TBH both of those are much lower than I expected -- maybe I slipped up in the maths somewhere? -- but anyway, it's way more than anyone reading this book is likely to want to spend, both in time and money.

Rashka's solution to this, which seems like the most reasonable compromise he could have reached given the constraints inherent in having a book that wants to teach people how to build an LLM from scratch, is to work through everything required to start the training of the LLM, train it for a bit to show that it works, and then to point people at a set of pretrained weights that they can download and slot in for futher fine-tuning.

While it would be nice to have an LLM and be able to say "that's all mine, I trained it completely from scratch", this does seem to be the most reasonable practical workaround, given that really doing it from scratch would be impossibly expensive.

Not-quite-errata

There were a couple of things in the chapter that weren't -- at least to my mind -- quite correct. I think that they can probably be treated as simplifications that Raschka has put in there for the sake of readability, as I'm 100% sure there's no confusion in his own mind about these things!

When he first mentions LLMs (before he's said that by LLM, he means GPT) he says that they're focused on next-word prediction, which of course isn't entirely true when you consider BERT and similar. I think this is just a case of a little knowledge being a dangerous thing on my side -- I already knew that BERT could handle predicting masked tokens rather than just next word stuff, so I found that more confusing than someone who was not aware of that. [Update: just for clarity, this nit is specifically about the mention on page 2 -- it's all made clear later.]
He says that "[i]n contrast to deep learning, traditional machine learning requires manual feature extraction", which seems a bit off to me -- after all, there are other ML systems (eg. kernels, or even naive Bayes for spam filtering) that can extract features on their own. That's a bit of a nit, though, and it makes sense to keep things simple.

Summary

So, those were my takeaways from chapter one of the book. There were plenty of other things covered, of course, but they were either scene-setting for the future chapters or background information that I already knew well enough that it didn't surprise me.

I'm really looking forward to chapter 2 tomorrow!