Writing an LLM from scratch, part 6 -- starting to code self-attention
This is the sixth in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Each time I read part of it, I'm posting about what I found interesting as a way to help get things straight in my own head -- and perhaps to help anyone else that is working through it too. This post covers just one subsection of the trickiest chapter in the book -- subsection 3.3.1, "A simple self-attention mechanism without trainable weights". I feel that there's enough in there to make up a post on its own. For me, it certainly gave me one key intuition that I think is a critical part of how everything fits together.
As always, there may be errors in my understanding below -- I've cross-checked and run the whole post through Claude, ChatGPT o1, and DeepSeek r1, so I'm reasonably confident, but caveat lector :-) With all that said, let's go!
Writing an LLM from scratch, part 5 -- more on self-attention
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it. In retrospect, it was kind of adorable that I thought I could get it all done over my Christmas break, given that I managed just the first two-and-a-half chapters! However, now that the start-of-year stuff is out of the way at work, hopefully I can continue. And at least the two-week break since my last post in this series has given things some time to stew.
In the last post I was reading about attention mechanisms and how they work, and was a little thrown by the move from attention to self-attention, and in this blog post I hope to get that all fully sorted so that I can move on to the rest of chapter 3, and then the rest of the book. Rashka himself said on X that this chapter "might be the most technical one (like building the engine of a car) but it gets easier from here!" That's reassuring, and hopefully it means that my blog posts will speed up too once I'm done with it.
But first: on to attention and what it means in the LLM sense.
Writing an LLM from scratch, part 4
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.
Here's a link to the previous post in this series.
Today I read through chapter 3, which introduces and explains attention mechanisms -- the core architecture that allows LLMs to "understand" the meaning of text in terms of the relationships between words. This feels like the core of the book; at least, for me, it's the part of the underlying workings of LLMs that I understand the least. I knew it was something to do with the LLM learning which other words to pay attention to when looking at a particular one, but that's pretty much it.
And it's a tough chapter. I finished with what I felt was a good understanding at a high level of how the calculations that make up self-attention in an LLM work -- but not of how self-attention itself works. That is, I understood how to write one, in terms of the steps to follow mathematically, but not why that specific code would be what I would write or why we would perform those mathematical operations.
I think this was because I tried to devour it all in a day, so I'm going to go through much more slowly, writing up notes on each section each day.
Today, I think, I can at least cover the historical explanation of how attention mechanisms came to be in the first place, because that seems reasonably easy to understand.
Writing an LLM from scratch, part 3
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.
Here's a link to the previous post in this series.
Today I was working through the second half of Chapter 2, "Working with text data", which I'd started just before Christmas. Only two days off, so it was reasonably fresh in my mind :-)
Writing an LLM from scratch, part 2
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and planning to post every day (or at least, every day I read some of it -- Christmas day I suspect I'll not be posting) with notes on what I found interesting.
Here's a link to the previous post in this series.
I had been planning to do a chapter a day, but that is looking optimistic for such a dense book! So today, I've read the first half or so of Chapter 2, "Working with text data". This gives an overview of the pre-processing that happens to text before it hits the LLM, goes on to describe a simple tokenization system (complete with source code), and then briefly covers the byte pair encoding method that we'll actually be using for the LLM.
Writing an LLM from scratch, part 1
Over the Christmas break (and probably beyond) I'm planning to work through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm expecting to get through a chapter or less a day, in order to give things time to percolate properly. Each day, or perhaps each chapter, I'll post here about anything I find particularly interesting.
Today, it was what is most likely the easiest bit; the introductory chapter 1, "Understanding large language models".