Writing an LLM from scratch, part 4

About

Contact

Writing an LLM from scratch, part 4

Posted on 28 December 2024 in Programming, Python, AI, LLM from scratch

I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.

Here's a link to the previous post in this series.

Today I read through chapter 3, which introduces and explains attention mechanisms -- the core architecture that allows LLMs to "understand" the meaning of text in terms of the relationships between words. This feels like the core of the book; at least, for me, it's the part of the underlying workings of LLMs that I understand the least. I knew it was something to do with the LLM learning which other words to pay attention to when looking at a particular one, but that's pretty much it.

And it's a tough chapter. I finished with what I felt was a good understanding at a high level of how the calculations that make up self-attention in an LLM work -- but not of how self-attention itself works. That is, I understood how to write one, in terms of the steps to follow mathematically, but not why that specific code would be what I would write or why we would perform those mathematical operations.

I think this was because I tried to devour it all in a day, so I'm going to go through much more slowly, writing up notes on each section each day.

Today, I think, I can at least cover the historical explanation of how attention mechanisms came to be in the first place, because that seems reasonably easy to understand.

Let's go back to the example of machine translation. Before transformers, one attempt at machine translation used recurrent neural networks (RNNs) to do a process that's rather like a cut-down version of the transformers system that I described in the first post in this series. Here's that description, for reference:

The input text in the source language was run through an encoder, an RNN that processed it a word (or maybe a letter or a token) at a time and built up a vector (specifically in its hidden state) that represented in some abstract way the meaning of the sentence -- basically, an embedding.
That embedding was then passed in to a decoder, which was an RNN that churned through the embedding to produce an output in the target language. (I believe that the RNN would also modify the embedding as it went along in order to keep track of how far it had got in the sentence).

Now, there's a pretty obvious problem here in that the embedding needs to somehow be able to represent pretty much any text that we want to translate -- and that limits the size because if you've got a 10-dimensional vector of floats and you're trying to pack the whole Wikipedia page for the Visigothic Kingdom into it then you're going to have problems; the embedding would contain at best a gist.

But that wasn't the core issue. Even with single sentences -- long, complex ones, but still reasonable sentences -- these encoder-decoder RNNs had problems.

I must admit, this was pretty hard to get completely straight in my head, and I'm not sure I'm all of the way there yet. But here's a first try.

To get a good concise example to explain the problem, it's a good idea to use German as the target language because it has a habit of piling verbs up at the end of the sentence, which can make even (relatively) simple sentences have long-range dependencies between words that wouldn't exist in equivalently simple English sentences. The problem holds for complex sentences in all languages, though, at least as I understand it.

Let's say that we want to translate this English sentence into German:

The dog, who must have been able to see the cat running through the garden.

The German would be:

Der Hund, der die Katze durch den Garten hatte jagen sehen können müssen.

...which literally translates as:

The dog, who the cat through the garden had chasing see can must

Now, somehow the decoder, having emitted "Der Hund, der" needs to keep track of the fact that at some later point at the end of the sentence it will need to add on all of those verbs -- but in the meantime it needs to emit the stuff about the cat and the garden, while continuing to maintain the state about the verbs.

So there are two ways you could do this if you were a human translator:

You could read the sentence and keep the meaning in your mind and just type out the German (which, TBF, a competent translator could do with a sentence this short, but let's imagine we're a not-so-competent translator). It's easy to see why you might wind up losing track of which verbs you needed to include and forgetting to add some of them. I've been (of course) having a long conversation with Claude about this, and am told that this was exactly what happened with RNNs -- part of a larger problem known as the "vanishing gradient problem"
Or alternatively, we could read it, remember the gist, and then write the German sentence occasionally checking back on the English original to make sure that we'd included everything.

The second option sounds much easier. And the breakthrough that moved in that direction was Bahdanau attention, which was invented by this chap (immediate follow on X :-).

The core here was to give the decoder access to the input sequence as well as to the embedding, and to provide it with a system that told it which word in the input to look at -- to pay attention to -- for each output step, I assume with some kind of input from the embedding. The weights that controlled that system were part of what was learned as part of the training process for the encoder/decoder pair.

The next step after this was the drastic simplification that led to the architecture that I tried to summarise in the first post in this series -- transformers, which replaced the complex RNNs used in this decoder-encoder setup with much simpler normal neural networks, leaning on the attention mechanism to do the heavy lifting. And having just googled the paper that introduced transformers, Attention is all you need, I'm delighted that I fully understand the introductory sentences in the abstract:

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

So that's a big win, at least for me!

But Rashka now moves on to talking about self-attention (as opposed to the more generic attention that we've been discussing so far) and I feel I have some kind of mental block I need to work through in order to understand that well enough to put it into words. So I'm going to stop here and sleep on that; hopefully I'll be posting a clear and... well, probably not concise, explanation of how I understand that to work tomorrow.