Basic matrix maths for neural networks: in practice
This is the second post in my short series of tutorials on matrix operations for neural networks, targeted at beginners, and at people who have some practical experience, but who haven't yet dug into the underlying theory. Again, if you're an experienced ML practitioner, you should skip this post -- though if you want to read it anyway, any comments or suggestions for improvements would be much appreciated!
In my last post in the series, I showed how to derive the formulae to run a neural network from the basic principles of matrix maths. I gave two formulae that are generally used in mathematical treatments of NNs -- one with a separate bias matrix:
...and one with the bias terms baked into the weights matrix, and the inputs matrix extended with a row of s at the bottom:
However, I finished off by saying that in real production implementations, people normally use this instead:
...which you might have seen in production PyTorch code looking like this:
Z = X @ W.T + B
This post explores why that form of the equation works better in practice.
Basic matrix maths for neural networks: the theory
I thought it would be worth writing a post on how matrix multiplication is used to calculate the output of neural networks. We use matrices because they make the maths easier, and because GPUs can work with them efficiently, allowing us to do a whole bunch of calculations with a single step -- so it's really worth having a solid grounding in what the underlying operations are.
If you're an experienced ML practitioner, you should skip this post. But you might find it useful if you're a beginner -- or if, like me until I started working through this, you've coded neural networks and used matrix operations for them, but apart from working through an example or two by hand, you've never thought through the details.
In terms of maths, I'll assume that you know what a vector is, what a matrix is, and have some vague memories of matrix multiplication from your schooldays, but that's it -- everything else I will define.
In terms of neural networks, I'll assume that you are aware of their basic layout and how they work in a general sense -- but there will be diagrams for clarity and I'll define specific terms.
So, with expectations set, let's go!
On the perils of AI-first debugging -- or, why Stack Overflow still matters in 2025
"My AI hype/terror level is directly proportional to my ratio of reading news about it to actually trying to get things done with it."
This post may not age well, as AI-assisted coding is progressing at an absurd rate. But I think that this is an important thing to remember right now: current LLMs can not only hallucinate, but they can misweight the evidence available to them, and make mistakes when debugging that human developers would not. If you don't allow for this you can waste quite a lot of time!
Writing an LLM from scratch, part 7 -- wrapping up non-trainable self-attention
This is the seventh post in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Each time I read part of it, I'm posting about what I found interesting or needed to think hard about, as a way to help get things straight in my own head -- and perhaps to help anyone else that is working through it too.
This post is a quick one, covering just section 3.3.2, "Computing attention weights for all input tokens". I'm covering it in a post on its own because it gets things in place for what feels like the hardest part to grasp at an intuitive level -- how we actually design a system that can learn how to generate attention weights, which is the subject of the next section, 3.4. My linear algebra is super-rusty, and while going through this one, I needed to relearn some stuff that I think I must have forgotten sometime late last century...
Writing an LLM from scratch, part 6b -- a correction
This is a correction to the sixth in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".
I realised while writing the next part that I'd made a mistake -- while trying to get an intuitive understanding of attention mechanisms, I'd forgotten an important point from the end of my third post. When we convert our tokens into embeddings, we generate two for each one:
- A token embedding that represents the meaning of the token in isolation
- A position embedding that represents where it is in the input sequence.
These two are added element-wise to get an input embedding, which is what is fed into the attention mechanism. However, in my last post I'd forgotten completely about the position embedding and had been talking entirely in terms of token embeddings.
Surprisingly, though, this doesn't actually change very much in that post -- so I've made a few updates there to reflect the change. The most important difference, at least to my mind, is that the fake non-trainable attention mechanism used -- the dot product of the input embeddings -- is, while still excessively basic, not quite as bad as it was. My old example was that in
the fat cat sat on the mat
...the token embeddings for the two "the"s would be the same, so they'd have super-high attention scores for each other. When we consider that it would be the dot product of the input embeddings instead, they'd no longer be identical because they would have different position embeddings. However, the underlying point holds that they would be too closely attending to each other.
Anyway, if you're reading along, I don't think you need to go back and re-read it (unless you particularly want to!). I'm just posting this here for the record :-)
Writing an LLM from scratch, part 6 -- starting to code self-attention
This is the sixth in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Each time I read part of it, I'm posting about what I found interesting as a way to help get things straight in my own head -- and perhaps to help anyone else that is working through it too. This post covers just one subsection of the trickiest chapter in the book -- subsection 3.3.1, "A simple self-attention mechanism without trainable weights". I feel that there's enough in there to make up a post on its own. For me, it certainly gave me one key intuition that I think is a critical part of how everything fits together.
As always, there may be errors in my understanding below -- I've cross-checked and run the whole post through Claude, ChatGPT o1, and DeepSeek r1, so I'm reasonably confident, but caveat lector :-) With all that said, let's go!
Do reasoning LLMs need their own Philosophical Language?
A few days ago, I saw a cluster of tweets about OpenAI's o1 randomly switching to Chinese while reasoning -- here's a good example. I think I've seen it switch languages a few times as well. Thinking about it, Chinese -- or any other language written in a non-Latin alphabet -- would be particularly noticeable, because those notes describing what it's thinking about flash by pretty quickly, and you're only really likely to notice something weird if it's immediately visibly different to what you expect. So perhaps it's spending a lot of its time switching from language to language depending on what it's thinking about, and then it translates back to the language of the conversation for the final output.
Why would it do that? Presumably certain topics are covered better in its training set in specific languages -- it will have more on Chinese history in Chinese, Russian history in Russian, and so on. But equally possibly, some languages are easier for it to reason about certain topics in. Tiezhen Wang, a bilingual AI developer, tweeted that he preferred doing maths in Chinese "because each digit is just one syllable, which makes calculations crisp and efficient". Perhaps there's something similar there for LLMs.
That got me thinking about the 17th-century idea of a Philosophical Language. If you've read Neal Stephenson's Baroque Cycle books, you'll maybe remember it from there -- that's certainly where I heard about it. The idea was that natural human languages were not very good for reasoning about things, and the solution would be to create an ideal, consciously-designed language that was more rational. Then philosophers (or scientists as we'd say these days) could work in it and get better results.
There are echos of that in E' (E-Prime), another one I picked up on from fiction (this time from The Illuminatus! Trilogy). It's English, without the verb "to be", the idea being that most uses of the word are unnecessarily foggy and would be better replaced. "Mary is a doctor" implies that her job is the important thing about her, whereas "Mary practices medicine" is specific that it's one just aspect of her. What I like about it is that it -- in theory -- gets a more "Philosophical" language with a really small tweak rather than a complete redesign.
What I'm wondering is, are human languages really the right way for LLMs to be reasoning if we want accurate results quickly? We all know how easy it is to be bamboozled by words, either our own or other people's. Is there some way we could construct a language that would be better?
The baroque philosophers ultimately failed, and modern scientists tend to switch to mathematics when they need to be precise ("physics is a system for translating the Universe into maths so that you can reason about it" -- discuss).
But perhaps by watching which languages o1 is choosing for different kinds of reasoning we could identify pre-existing (grammatical/morphological/etc) structures that just seem to work better for different kinds of tasks, and then use that as a framework to build something on top of. That feels like something that could be done much more easily now than it could in the pre-LLM world.
Or maybe a reasoning language is something that could be learned as part of a training process; perhaps each LLM could develop its own, after pre-training with human languages to get it to understand the underlying concept of "language". Then it might better mirror how LLMs work -- its structures might map more directly to the way transformers process information. It might have ways of representing things that you literally could not describe in human languages.
Think of it as a machine code for LLMs, perhaps. Is it a dumb idea? As always, comments are open :-)
Writing an LLM from scratch, part 5 -- more on self-attention
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it. In retrospect, it was kind of adorable that I thought I could get it all done over my Christmas break, given that I managed just the first two-and-a-half chapters! However, now that the start-of-year stuff is out of the way at work, hopefully I can continue. And at least the two-week break since my last post in this series has given things some time to stew.
In the last post I was reading about attention mechanisms and how they work, and was a little thrown by the move from attention to self-attention, and in this blog post I hope to get that all fully sorted so that I can move on to the rest of chapter 3, and then the rest of the book. Rashka himself said on X that this chapter "might be the most technical one (like building the engine of a car) but it gets easier from here!" That's reassuring, and hopefully it means that my blog posts will speed up too once I'm done with it.
But first: on to attention and what it means in the LLM sense.
An AI chatroom (a few steps further)
Still playing hooky from "Build a Large Language Model (from Scratch)" -- I was on our support rota today and felt a little drained afterwards, so decided to finish off my AI chatroom. The the codebase is now in a state where I'm reasonably happy with it -- it's not production-grade code by any stretch of the imagination, but the structure is acceptable, and it has the basic functionality I wanted:
- A configurable set of AIs
- Compatibility with the OpenAI API (for OpenAI itself, Grok and DeepSeek) and with Anthropic's (for Claude).
- Persistent history so that you can start a chat and have it survive a restart of the bot.
- Pretty reasonable behaviour of the AIs, with them building on what each other say.
An AI chatroom (beginnings)
So, I know that I decided I would follow a "no side quests" rule while reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", but rules are made to be broken.
I've started building a simple Telegram bot that can be used to chat with multiple AI models at the same time, the goal being to allow them to have limited interaction with each other. I'm not sure if it's going to work well, and it's very much a work-in-progress -- but here's the repo.
More info below the fold.