Writing an LLM from scratch, part 6b -- a correction

Posted on 28 January 2025 in Programming, Python, AI, LLM from scratch

This is a correction to the sixth in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

I realised while writing the next part that I'd made a mistake -- while trying to get an intuitive understanding of attention mechanisms, I'd forgotten an important point from the end of my third post. When we convert our tokens into embeddings, we generate two for each one:

These two are added element-wise to get an input embedding, which is what is fed into the attention mechanism.

This doesn't actually change very much in my last post -- so I've made a few updates there to reflect the change. The most important difference, at least to my mind, is that the fake non-trainable attention mechanism used -- the dot product of the input embeddings -- is, while still excessively basic, not quite as bad as it was. My old example was that in

the fat cat sat on the mat

...the token embeddings for the two "the"s would be the same, so they'd have super-high attention scores for each other. When we consider that it would be the dot product of the input embeddings instead, they'd no longer be identical because they would have different position embeddings. However, the underlying point holds that they would be too closely attending to each other.

Anyway, if you're reading along, I don't think you need to go back and re-read it (unless you particularly want to!). I'm just posting this here for the record :-)