Basic matrix maths for neural networks
I thought it would be worth writing a post on how matrix multiplication is used to calculate the output of neural networks. We use matrices because they make the maths easier, and because GPUs can work with them efficiently, allowing us to do a whole bunch of calculations with a single step -- so it's really worth having a solid grounding in what the underlying operations are.
If you're an experienced ML practitioner, you should skip this post. But you might find it useful if you're a beginner -- or if, like me until I started working through this, you've coded neural networks and used matrix operations for them, but apart from working through an example or two by hand, you've never thought through the details.
In terms of maths, I'll assume that you know what a vector is, what a matrix is, and have some vague memories of matrix multiplication from your schooldays, but that's it -- everything else I will define.
In terms of neural networks, I'll assume that you are aware of their basic layout and how they work in a general sense -- but there will be diagrams for clarity and I'll define specific terms.
So, with expectations set, let's go!
Basic matrix maths
Let's start with some definitions. The dot product is an operation that works on two vectors of the same length. It simply means that you multiply the corresponding elements, then add up the results of those multiplications:
Or, more concretely:
Simple enough. Now let's define matrix multiplication. If you have a matrix which has rows and columns -- that is, it's an matrix -- you can multiply it by any other matrix , so long as has rows -- that is, it's , where can be anything. To put it another way, the number of columns in the first matrix in a matrix multiplication has to equal the number of rows in the second one.1
The result of the matrix multiplication is a new matrix, with rows and columns -- that is, it has the same number of rows as the first one, and the same number of columns as the second one.
How do we fill in the values in that result matrix? Let's start with a formal definition, and then move on to an example. Formally, the value of -- that is, the element at row , column in the output matrix -- is the dot product of row in the first matrix, taken as a vector, with column in the second matrix, also considered as a vector. 2
This example should help make it a bit clearer:
Looking at the element at in the result -- second row, second column, right in the middle -- you can see that it's the dot product of second row in the first matrix and the second column in the second:
There are lots of different ways of visualising this process -- I rather like this animation. But the formal definition using the dot product is the important one, at least for our purposes in this post.
Now, let's also note one extra thing that will become important later, a minimal case of matrix multiplication: if we take a matrix and a matrix, and multiply them, we get a matrix that contains the dot product of the two original matrices regarded as vectors:
Hopefully that's pretty obvious as a consequence of the definitions above. So on that note, let's move on to neural networks, and how to use these operations to calculate their results using matrices.
Neural networks
Here's a simple one-layer neural network:
On the left, we have three inputs, , , and . They feed into two neurons, and ; each input goes to both neurons. On their way to the neurons, the inputs are multiplied by weights. The weight in question is written next to the connecting arrow -- for example, on its way to neuron , input is multiplied by the weight . You can see that each weight has a subscript which is the neuron number, then the input it relates to.
Each of the neurons adds together all of the weighted input values that it has received, adds on a bias ( for , for ), runs the resulting number through an activation function, and that provides the output for that neuron -- for , for .
For this post, I'm going to disregard the activation function, because that doesn't have any effect on what we use matrices for -- it's just a simple function taking one number and returning another, and is applied to the output value after all of the matrix-y stuff. I'm also going to put aside the bias for now, though we'll come back to that later.
Matrix maths for one neuron
So let's take a look at what is happening at neuron . It's receiving the inputs, each one multiplied by its associated weight, and adding those numbers up -- this is sometimes called the weighted sum or the net input for that neuron. Now, multiplying things together and adding them up is a dot product! So let's write it out like that:
Let's convert that to one of those minimal matrix multiplications that we did above:
So we can see that this simple matrix calculation has worked out our weighted sum for .
Multiple neurons
Now that we have it in matrix form, we can expand it. Remember that an matrix multiplied by an one produces an one, where each element is the dot product of the matching row in the first input matrix taken as a vector and the matching column in the second, likewise as a vector. So if we add a second row to the first matrix in the above, we'll get a second row in our result:
That's pretty cool! By adding a second row to the first matrix containing the weights for the second neuron, we've been able to calculate the weighted sums for both and in a single matrix multiplication.
You can see that if we had more neurons, we could just tack their weights on as extra rows in the first matrix. If we had more inputs, we would just add on extra rows for the values in the second matrix -- and, of course, we'd need weights for them, which would go into extra columns in the first matrix.
Another nifty thing that might not be immediately obvious is that the result matrix has the outputs stacked up as a column, just like the inputs were. If we added the bias and applied the activation function to those, we'd have the outputs in exactly the same format as the inputs. That's useful, because if we had a multi-layer neural network, where the output of the first layer was the input for the second, and so on, then we could just do the calculations for the first and then feed the results directly into the calculations for the second.
But obviously we need to sort out the bias and activation functions first.
Batches
Before we do that, though: we've seen what we can do by adding on rows to the first matrix so that it changes from being the weights for just one neuron to being the weights for all of the neurons in the layer. What happens if we add on extra columns to the inputs matrix? We get the ability to do batches -- that is, to run a bunch of separate sets of inputs through the neural network in one go:
That's getting a bit busy, but hopefully you can see that by providing two sets of inputs, one column for each, in our second matrix, we have -- with a single matrix multiplication -- evaluated the neural network for both inputs, and got our weighted sums.
Let's simplify the whole thing. We can call our matrix of all of the weights -- one row per neuron -- , the matrix of our inputs -- one column per item in our batch -- , and this intermediate weighted sum output we can call . The whole operation of this part of the evaluation of the neural network is just this:
That's it -- the bulk of the calculations for a neural network layer of arbitrary size, including batch processing, wrapped up in one simple equation!
But we still have that bias term. What can we do with that?
Bias: version 1
Well, the bias values are just a set of numbers that we need to add on to the weighted sums that we've already calculated, so we could just do that directly. Let's take the last example, but replace the worked-out dot products with s for sanity's sake:
That is, the weighted sums for our input are , and the weighted sums for the input are .
We can create a matrix of the same size as the outputs containing the bias terms as columns, repeated once for each example in the batch, and add that to our matrix:
...and that gives us our pre-activation values -- that is, the weighted sums plus the biases -- ready to be run through the activation function. We can write out the whole thing like this (using to represent the pre-activation values):
This is actually quite close to how it's normally done in practice (with a couple of wrinkles that I'll get into later). But, at least to me, it feels slightly ugly. After decades of computer programming, having the size of have to dynamically change to adjust to the batch size feels kind of wrong -- it has the same kind of smell as an issue with the typing of a parameter, in some way I can't quite put my finger on. 3
Bias: version 2
There's another method, which feels more elegant to me, that is often used in more mathematical descriptions. Let's go back to our simple example from earlier, with a batch size of one:
The bias term is kind of like a weight. It goes into the neuron just like the product of the weights and the inputs do, and -- when we're training a neural network -- it's a trainable parameter just like the weights. The only difference is that it's not multiplied by an input. Well, multiplying a number by one is the same as not multiplying it by anything, so what happens if we extend the input matrix with some s and just treat the bias terms as weights?
We've done the whole pre-activation calculation in a single step! And we can do the same trick with batches of multiple inputs.
That, to me, is a very mathematically elegant way to do it. If we call the version of that has been adjusted to have a row of s at the bottom . and the weights adjusted to have the bias terms in a new column on the right , our pre-activation function calculations come down to this:
For some reason, "hacking" the input and weight matrices to add on an extra row/column feels less inelegant to me than changing the size of the bias matrix to match the batch size. So I personally prefer this formulation. It also puts all of the learnable parameters for the neural network in one variable, which feels nice and clean.
Wrapping up
So, that's how we can use matrix maths to simplify the calculation of a neural network -- even when we're calculating batches of inputs in one go -- down to a one-line formula.
As I said, though, in real-world ML code, we actually use a formula that is similar to these, but differs in important ways. It normally looks like this:
You can see that we're using the calculation I found less elegant, but we've swapped around and , and is also transposed (that's what the superscript is for). Why? I'll post about that next.
-
This, of course, means that matrix multiplication is different to normal multiplication because it's not commutative -- that is, . Indeed, might not even be meaningful, if the number of columns in is not equal to the number of rows in . ↩
-
If you think about it, that explains why the number of columns in the first needs to match the number or rows in the second -- these are the lengths of the vectors of which we're taking the dot product, and that operation has to be between two vectors of the same length. ↩
-
Interestingly, the way it's done in practice, which I would expect to set off the same "type error" spidey-sense, seems OK to me. Not sure why. ↩