Basic matrix maths for neural networks: in practice

About

Contact

Basic matrix maths for neural networks: in practice

Posted on 22 February 2025 in AI

This is the second post in my short series of tutorials on matrix operations for neural networks, targeted at beginners, and at people who have some practical experience, but who haven't yet dug into the underlying theory. Again, if you're an experienced ML practitioner, you should skip this post -- though if you want to read it anyway, any comments or suggestions for improvements would be much appreciated!

In my last post in the series, I showed how to derive the formulae to run a neural network from the basic principles of matrix maths. I gave two formulae that are generally used in mathematical treatments of NNs -- one with a separate bias matrix:

\hat{Z} = W X + B

...and one with the bias terms baked into the weights matrix, and the inputs matrix extended with a row of $1$ s at the bottom:

\hat{Z} = \overset{―}{W} \overset{―}{X}

However, I finished off by saying that in real production implementations, people normally use this instead:

\hat{Z} = X W^{T} + B

...which you might have seen in production PyTorch code looking like this:

Z = X @ W.T + B

This post explores why that form of the equation works better in practice.

There are three important things about this form:

The bias is added in separately, like the first one of the two equations from last time. Why do we do that rather than the other (slightly simpler-looking) equation without the extra addition, where the biases are baked into the weights matrix?
The order of the two matrices in the multiplication is reversed -- it's inputs times weights rather than weights times inputs. Why?
We are transposing the weights matrix; why would we do that rather than storing it in a form where it's already "pre-transposed"?

I'll cover each of these separately.

The bias term

Let's recap the two ways we were handling bias terms in the last post.

The first option was to add it on at the end, kind of like the practical solution does. We used a matrix multiplication to work out the weighted sums of the inputs for all neurons, for all items in a batch (see the previous post if those terms are unfamiliar):

[\begin{matrix} w_{1, 1} & w_{1, 2} & w_{1, 3} \\ w_{2, 1} & w_{2, 2} & w_{2, 3} \end{matrix}] [\begin{matrix} x_{1}^{1} & x_{1}^{2} \\ x_{2}^{1} & x_{2}^{2} \\ x_{3}^{1} & x_{3}^{2} \end{matrix}] = [\begin{matrix} z_{1}^{1} & z_{1}^{2} \\ z_{2}^{1} & z_{2}^{2} \end{matrix}]

The output $Z$ has one row per neuron -- that is, the first row is the weighted sum for neuron one -- and one column for each item in the batch -- so with our batch size of two, we have two columns.

Then we constructed a matrix $B$ that had the same shape as $Z$ using the biases. Every column would be the same, because the biases are the same across the batch. Within each column, the values would be the biases for each neuron in the same order as they were for $Z$ . We'd then add $B$ to $Z$ :

[\begin{matrix} z_{1}^{1} & z_{1}^{2} \\ z_{2}^{1} & z_{2}^{2} \end{matrix}] + [\begin{matrix} b_{1} & b_{1} \\ b_{2} & b_{2} \end{matrix}] = [\begin{matrix} z_{1}^{1} + b_{1} & z_{1}^{2} + b_{1} \\ z_{2}^{1} + b_{2} & z_{2}^{2} + b_{2} \end{matrix}]

That gave us our pre-activation values for the batch, ready to be run through the activation function and become outputs.

My objection to this form was largely stylistic -- it just felt "wrong" in some sense to have to adapt the size of the bias matrix $B$ based on the number of inputs in the batch. But when you think about it in practical terms, it's actually even worse. The example above is a small, toy matrix. In reality, you would have many more neurons, so the bias matrix could be quite large -- which means, given that all of the columns are duplicated, you're allocating lots of memory and copying stuff around just so that you can add the numbers together. This sounds -- and would be -- wasteful. So why use it in practice? There's a good reason, but let's look at the issues with the other way of handling the bias terms first.

The second way to handle the bias terms was essentially to treat them like weights that were multiplied by a dummy input that was always set to $1$ . Here's an example (with a batch size of one, just for simplicity's sake):

[\begin{matrix} w_{1, 1} & w_{1, 2} & w_{1, 3} & b_{1} \\ w_{2, 1} & w_{2, 2} & w_{2, 3} & b_{2} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \\ 1 \end{matrix}] = [\begin{matrix} w_{1, 1} \cdot x_{1} + w_{1, 2} \cdot x_{2} + w_{1, 3} \cdot x_{3} + b_{1} \cdot 1 \\ w_{2, 1} \cdot x_{1} + w_{2, 2} \cdot x_{2} + w_{2, 3} \cdot x_{3} + b_{2} \cdot 1 \end{matrix}]

You can see that if we had a larger batch size, where each item in the batch was a column, we'd just add on a row of $1$ s at the bottom.

So why not use that in practice? Well, the "just" in that last sentence is misleading. Conceptually it's easy to add a row to a matrix, but in reality, reshaping a matrix can be relatively hard work. If you've done low-level programming in C, for example, think about what would be involved in resizing an array. If you've allocated enough memory for its current size, to resize it you'll need to allocate enough for the new size and then copy everything across. That's not a cheap operation! We can also add onto that the fact that we're doing a whole bunch of multiplications by one as part of our matrix multiplication, which is pretty wasteful.

So there are problems with both solutions. The first copies data around unnecessarily to build a matrix with duplicate data, and the second resizes a matrix, which is expensive, then does a load of unnecessary multiplications by $1$ .

The solution, and the reason we can actually use something like the first version for a practical implementation, is called broadcasting.

Let's think about matrix addition in normal mathematical terms. It's very simple, you just add together the corresponding elements in the two matrices you're adding:

[\begin{matrix} a & b \\ c & d \\ e & f \end{matrix}] + [\begin{matrix} t & u \\ v & w \\ x & y \end{matrix}] = [\begin{matrix} a + t & b + u \\ c + v & d + w \\ e + x & f + y \end{matrix}]

As you might expect, the matrices both need to be exactly the same size (same number of rows, same number of columns) if you're going to do that.

PyTorch and other ML frameworks extend the idea of addition (and, I believe, other element-wise operations). If you add two matrices of the same size, they just do the normal mathematical addition. But, if one of them has just one column (or just one row) and matches the other matrix exactly in the other dimension, then it "broadcasts" the smaller matrix across the larger one. The end result is just like the matrix with the duplicated columns, but without the unnecessary duplication in memory. Let's modify the earlier example to make it concrete:

[\begin{matrix} z_{1}^{1} & z_{1}^{2} \\ z_{2}^{1} & z_{2}^{2} \end{matrix}] + [\begin{matrix} b_{1} \\ b_{2} \end{matrix}] = [\begin{matrix} z_{1}^{1} + b_{1} & z_{1}^{2} + b_{1} \\ z_{2}^{1} + b_{2} & z_{2}^{2} + b_{2} \end{matrix}]

That equation is not valid in mathematical terms, due to the matrix size mismatch, but as a PyTorch operation it makes sense. While the actual implementation is probably highly optimised and clever, I think it's safe to mentally model this as a kind of implicit for loop. ¹

So PyTorch and similar frameworks support broadcasting, which means that we can use that to add on the bias terms without the duplication required by the first of the two solutions from last time, and without the resizing and the pointless multiplication by $1$ required by the second.

So that explains why we add in the biases separately in the real-world implementation. How about the reversing of the terms in the matrix multiplication, and that transposition?

Matrix ordering

Let's consider the order of the term first, and leave the transposition for the next section. In the last post, the equation I gave was

\hat{Z} = W X + B

...but our practical one is this:

\hat{Z} = X W^{T} + B

One question that you might ask is why did I do $W X$ rather than $X W$ in the first place? After all, while we can do the matrix multiplication like this:

[\begin{matrix} w_{1, 1} & w_{1, 2} & w_{1, 3} \\ w_{2, 1} & w_{2, 2} & w_{2, 3} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}] = [\begin{matrix} w_{1, 1} \cdot x_{1} + w_{1, 2} \cdot x_{2} + w_{1, 3} \cdot x_{3} \\ w_{2, 1} \cdot x_{1} + w_{2, 2} \cdot x_{2} + w_{2, 3} \cdot x_{3} \end{matrix}]

...if we transpose both matrices and reverse the order, we get this:

[\begin{matrix} x_{1} & x_{2} & x_{3} \end{matrix}] [\begin{matrix} w_{1, 1} & w_{2, 1} \\ w_{1, 2} & w_{2, 2} \\ w_{1, 3} & w_{2, 3} \end{matrix}] = [\begin{matrix} w_{1, 1} \cdot x_{1} + w_{1, 2} \cdot x_{2} + w_{1, 3} \cdot x_{3} & w_{2, 1} \cdot x_{1} + w_{2, 2} \cdot x_{2} + w_{2, 3} \cdot x_{3} \end{matrix}]

It works equally well from a mathematical viewpoint -- instead of multiplying a $2 \times 3$ matrix by a $3 \times 1$ matrix and getting a $2 \times 1$ result, we're multiplying a $1 \times 3$ one by a $3 \times 2$ one, and getting a $1 \times 2$ result.

The reason I put the weights first in my previous post is just because it's the convention in the more mathematical writing I've seen about neural networks. The early papers apparently used formulae for calculating a single neuron's value that looked like this:

z_{j} = \sum_{i = 1}^{n} w_{j, i} x_{i} + b_{j}

The weights came first, and that became the tradition -- and I put it in that order essentially because I was following that. So now the question becomes, why do practical implementations break with the tradition?

It's ultimately due to a different historical choice. Matrices in PyTorch and other numerical frameworks are stored in what is called row-major format. Let's dive into that. A computer's memory is linear -- integer memory addresses locate bytes. So to store a matrix, you could have a sequence of bytes representing one row, then the bytes for the next row, and so on, and that's called row-major storage. Alternatively, you could store one column, then the next, and so on, making it column-major.

Now, if we have data stored in row-major format and we want to add a new column onto the end, we obviously have to do a bunch of shuffling around -- we need to make space at the end of the first row for the new data by moving all of the following rows up, then the same for the second row, then the same for the third row, and so on. If you're storing data in a column-major format, you have the same problem if you're adding on a new row. By contrast, adding a new row onto a row-major matrix, or a new column onto a column-major one is relatively cheap. ²

There are similar issues if you want to partition or split matrices -- it's easy to create an object that is, say, the first $n$ rows of a row-major matrix, or the first $n$ columns of a column-major one, as in each case it's a contiguous block of memory. But it's much harder to do that for columns in row-major or rows in column-major, because you have to represent a series of "chunks" of memory somehow. ³

So, because the numerical frameworks are storing our matrices in a row-major format, it's relatively easy to add on new rows or get ranges of rows, and hard to add on new columns or get ranges of them. That makes it much better to have the items that make up our training set have one item per row, and to use the columns for the different inputs in that item. During a given run of a program, we're unlikely to be adding on new extra inputs, but we are likely to be building up a list of items to run through, splitting them up into batches, and so on. Row-major matrices mean that having one batch item per row is an efficient way to do that.

Now, of course, we could build up our data in a one-item-per-row format, split them up into batches, and then transpose those before feeding it into the neural network, and convert back at the end -- that would allow us to keep the traditional format for the calculation while playing to the frameworks' strengths in our pre- and post-processing data pipeline. But given that the traditional ordering was relatively arbitrary, that would be kind of pointless.

So, one row per batch item it is, which means the $X$ is a $b \times m$ matrix, where $b$ is the batch size and $m$ is the number of inputs to the neural network. Now, the number of columns in the first matrix in a multiplication must match the number of rows in the second. The one dimension that $X$ and our weights matrix $W$ have in common is $m$ , so that means that $m$ must be the matching number between them. And that means that $X$ must be the first term in the matrix multiplication.

Transposing the weights matrix

Given that, the second term in the multiplication needs to be $m \times p$ , where $p$ is the number of neurons.

But in the equation we're looking at is like this:

\hat{Z} = X W^{T} + B

What that means is that for some reason the weights matrix is kept in a $p \times m$ format, and we're transposing it to fit at "runtime". That is, it has one row per neuron, with each column representing the weight that that neuron applies to a particular input. Then, in the evaluation of the layer itself, we transpose it so that it is compatible with our one-row-per-batch-item inputs matrix.

Why is that? This is something where the results of my research have been... unsatisfying. For example, in this Reddit thread, the two most-upvoted comments are basically saying that it's to make the matrix multiplication work -- which to me seems to be missing the point. The question is not, given that we have a weights matrix that is $p \times m$ , why do we have to transpose it to multiply it by $X$ -- that's an obvious consequence of the mathematical rules. The question is, why are we not representing it as $m \times p$ in the first place? ⁴

I asked various AIs about it, and got some answers about optimisations, but on digging down, they seemed to collapse to "PyTorch etc are optimised this way so it's better to do it this way", which is all very well, but doesn't explain why they were optimised that way, when if the weights matrix were the other way around, they could have been optimised for that instead.

Another point that came up is that matrix transposition in modern numerical frameworks is cheap. This is useful to know -- we're not losing much performance by having that transposition in there -- but we're still losing something, and it would be good to know why!

Eventually, I got what I think is a solid answer -- at least, a solid one for our beginner's level of understanding -- from Claude (and I should note that Grok 3 came up with pretty much the same answer). Essentially: it's easier to understand.

When we think about neural networks, we tend to think in terms of the neurons. Let's look at the diagram from the last post:

A simple neural network

The neurons are large circles in the middle of the diagram, and the weights, inputs and outputs are much less prominent. That fits well with how we think of this kind of thing.

Now, if we were preparing a tabular description of a set of things, we would quite naturally put the things we're describing in rows and aspects of these things in columns. Imagine preparing a spreadsheet of employees in a company -- you'd have a row per person, with columns for name, role, salary, and so on. Or think about how SQL databases work -- each row is a record, and the columns are the values in that record. ⁵

Now with the weights matrix, we're conceptually specifying something about the neurons -- for each neuron, we have the weights that it applies to each input. Turning it around and saying that the weights matrix specifies, for each input, what weight it's multiplied by on the way to a particular neuron, feels -- at least to me -- less natural.

So on that basis, a row per neuron makes some kind of sense. In particular, if we were to add or remove neurons, conceptually it might be simpler to remove their rows from the matrix than to remove their columns.

(It's worth noting that the same logic applies pretty well to the input matrix -- each item in a batch being a row has the same conceptual "neatness" in my mind.)

However, I must admit, I'm not entirely happy with it as an explanation -- but right now, it's the best I have. If you know of a better one, please do leave a comment!

Putting it all together

So, we started with a question: in real-world neural network code, why do we use

\hat{Z} = X W^{T} + B

...rather than:

\hat{Z} = W X + B

\hat{Z} = \overset{―}{W} \overset{―}{X}

...?

We have a good answer for why we add on the bias rather than trying to bake it into the weights matrix -- broadcasting operations in ML frameworks make this the most efficient way to do things.

We have a reasonable answer for why we swapped around $X$ and $W$ -- having one row per batch item is the best way to store and manipulate our data outside the neural network, and so keeping it in that format during the calculations keeps things consistent, making it easier to understand what's going on -- and if we're doing it that way, the rules of matrix multiplication force us to put the data matrix $X$ first.

And finally, we have a less-satisfying answer for the last question, why we transpose the $W$ matrix during the calculation of the neural network rather than keeping it in a "pre-transposed" form all along -- the untransposed form is more "natural" and easier to reason about.

Thanks for reading, and any thoughts are welcome!

I'm focusing on the ability to broadcast a "1-dimensional" matrix across a 2-dimensional one here, but broadcasting extends to higher dimensions too -- you can broadcast an $m \times n$ matrix across an $m \times n \times b$ 3d tensor (which is fairly easy to visualise), but it extends to even higher (and impossible to visualise) dimensions. Pretty nifty. ↩
Adding a new row onto a row-major matrix can (as I said earlier when discussing adding a row of $1$ s to the end of the input matrix) involve reallocating memory and copying stuff around. However, compared to adding on a new column, it's a relatively cheaper operation -- just one reallocation and a copy rather than potentially multiple. ↩
And this is before you get on to questions of caching, where there is a big advantage in keeping the set of data you're working with close together in memory so that it can be loaded into the cache in one go. ↩
There are some comments further down the thread that hint that perhaps there are better reasons, which as best I can tell are related to this use of matrix multipliction being an instance of a larger class of linear algebra problems, and the representation makes more sense in the larger scheme of things. Unfortunately my maths is not yet good enough to understand if this is correct, or merely a larger-scale version of the same explanation. ↩
I imagine this is culturally dependent -- I wonder whether classical Chinese neural networks would be the other way around? It was, after all, written top-to-bottom, right-to-left. ↩