Giles' blog

Messing around with fine-tuning LLMs, part 6 -- measuring memory usage more systematically

Posted on 10 July 2024 in AI, Python, Fine-tuning LLMS, TIL deep dives |

My goal is to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA. I'm doing this as a way to try to understand how to do full-on multi-GPU training of a model that cannot be trained on just one GPU.

I've been building up to this goal gradually; so far, I've:

Fine-tuned a 0.5B model on my own machine.
Done the same, but in the cloud using Lambda Labs.
Run some multi-GPU training, but using the GPUs to run larger batches for the 0.5B model -- which in turn means training faster -- rather than to train a larger model.
Successfully fine-tuned the 8B model across multiple GPUs using ZeRO and DeepSpeed, but with the optimizer offloaded to CPU.
Done some initial experiments into memory usage to find out why I had to offload the optimizer, using the 0.5B model locally.

The experiments I did last time around were to find out why, when the DeepSpeed estimate_zero3_model_states_mem_needs_all_live function said that I would need just less than 18 GiB of VRAM per GPU to train the 8B model without offloading anything, in reality I needed 40 GiB and still had to offload the optimizer.

At the end of the experiments, I'd found:

At least part of the problem with the estimation function was that it did not take account of the sequence length being used for the training. In my very first post about fine-tuning, I'd found that the longer the sequence length, the more VRAM needed to tune (which makes perfect sense). My guess is that this is because the function is not designed for LLMs, but is rather for fixed-input models where the memory usage is more stable.
The memory usage for PyTorch is classified two ways: the "allocated" memory, which is actually in use for tensors, and the "reserved" memory, which is the allocated memory plus -- at least, from my reading of the docs at the time -- whatever is used for caches.
With a very short sequence length -- I had tested with it set to 10 -- the allocated memory in the train was closer to the results from the estimation function: in the case of the 0.5B model I was testing with locally, the function returned 8 GiB and the allocated VRAM was about 10 GiB.
Some extra memory above the allocated amount was needed for training; my take on that was that caches were (understandably) important.
However, it was possible to reduce the amount of reserved memory beyond the allocated (and to tell PyTorch to keep going even if it didn't have as much cache space as it wanted) if you set an environment variable:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

This time around I wanted to take a more systematic look at the effects of the sequence length and of that environment variable on memory usage and training speed. I'd previously been assuming that VRAM usage would vary linearly with sequence length, but I had no evidence for that. And while it looked like training speed decreased with increasing sequence length, I didn't have any hard numbers. Time to fix that hole in my knowledge!

The first step: do some careful measurements of those numbers on the 0.5B model locally. That's what this post is about -- the next one will be for the 8B model running on Lambda Labs.

[ Read more ]

Messing around with fine-tuning LLMs, part 5 -- exploring memory usage

Posted on 5 July 2024 in AI, Python, Fine-tuning LLMS, TIL deep dives |

My goal is to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA. I'm doing this as a way to try to understand how to do full-on multi-GPU training of a model that cannot be trained on just one GPU.

I've been building up to this goal gradually; so far, I've:

Fine-tuned a 0.5B model on my own machine.
Done the same, but in the cloud using Lambda Labs.
Run some multi-GPU training, but using the GPUs to run larger batches for the 0.5B model -- which in turn means training faster -- rather than to train a larger model.
Successfully fine-tuned the 8B model across multiple GPUs using ZeRO and DeepSpeed, but with the optimizer offloaded to CPU.

This time around, I wanted to find out why I had to offload the optimizer, because it didn't seem like it should be necessary. Hugging Face helpfully document a DeepSpeed function that you can call to estimate the VRAM requirements for training a model with ZeRO, and when I ran it against the 8B model, I got this:

(fine-tune) ubuntu@130-61-28-84:~/fine-tune-2024-04$ python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("meta-llama/Meta-Llama-3-8B"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=8, num_nodes=1)'
[2024-05-17 23:19:31,667] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
Loading checkpoint shards: 100%|============================================================================================================| 4/4 [00:02<00:00,  1.61it/s]
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 7504M total params, 525M largest layer params.
  per CPU  |  per GPU |   Options
  188.72GB |   1.96GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  335.50GB |   1.96GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  167.75GB |   3.70GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  335.50GB |   3.70GB | offload_param=none, offload_optimizer=cpu , zero_init=0
   23.48GB |  17.68GB | offload_param=none, offload_optimizer=none, zero_init=1
  335.50GB |  17.68GB | offload_param=none, offload_optimizer=none, zero_init=0

It was saying that I only needed 17.68 GiB VRAM per GPU with no optimizer offload -- but I had needed to offload it even though I had 40 GiB per GPU. Why was that? What was I doing wrong? The documents that mention that function also say:

these are just the memory requirements for the parameters, optimizer states and gradients, and you'll need a bit more for the CUDA kernels and activations

...but 22 GiB extra is more than "a bit more". I must have been misunderstanding something.

Digging into this took an embarrassing amount of time -- I started work on it shortly after publishing my last post in this series, so that's been more than a month! And it's embarrassing that I took so long because the reason why I should not trust the number reported by that script was staring me in the face from the start, and involved something I'd discovered in my first explorations into this stuff.

Still, I learned a lot over the course of these investigations, so I think it's worth showing at least some of the journey. The post below is a distilled version of my lab notes and is a little rambling, but you might find it interesting if you're also digging into memory usage during LLM training as a beginner. If not, and you're looking for more carefully planned experiments and results, hopefully the next post in this series will have more of those :-)

Let's get going.

[ Read more ]

Messing around with fine-tuning LLMs, part 4 -- training cross-GPU.

Posted on 21 May 2024 in AI, Python, Fine-tuning LLMS, TIL deep dives |

My goal is to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset. I'm doing this as a way to try to understand how to do full-on multi-GPU training of a model that literally cannot be trained on just one GPU. I've been building up to this goal gradually; so far, I've:

In that last step, I'd found a very useful page in the Hugging Face documentation. It split multi-GPU situations into three categories:

Your model fits onto on a GPU.
Your model doesn't fit onto a GPU (but the layers taken individually do).
The largest layer in your model is so big that it doesn't fit onto a GPU.

I'd interpreted that first point as "you can load the model onto just one GPU" -- that is, you can run inference on it because all of the parameters fit there (with some overhead for the data, activations, etc). However, my experiences showed that it meant "you can train the model on one GPU", which takes up significantly more VRAM than inference does. The suggested approaches they had for that category were all about having the model loaded and training on each GPU, which is good for speeding up training by training on multiple batches simultaneously, but doesn't help if you want multiple GPUs simply because you can't train the model on one GPU alone.

So my goal this time was to change my training strategy to use a technique that allowed the training of the entire model to be split across GPUs. Here's what I did.

[ Read more ]

Messing around with fine-tuning LLMs, part 3 -- moar GPUs

Posted on 15 May 2024 in AI, Python, Fine-tuning LLMS, TIL deep dives |

Having fine-tuned a 0.5B model on my own machine, I wanted to try the same kind of tuning, but with an 8B model. I would need to do that on a cloud provider, because my own machine with its RTX 3090 would definitely not be up to the job, and I'd tried out Lambda Labs and found that it worked pretty well.

Importantly, I was going to need to train with multiple GPUs in order to do this. The Lambda Labs options open to me last time around were:

1x H100 (80GiB PCIe) at $2.49/hour
1x A10 (24GiB PCIe) at $0.75/hour
1x A100 (40GiB SXM4) at $1.29/hour
8x A100 (40GiB SXM4) at $10.32/hour

The last of those looked like it should be ample for my needs, which I'd estimated at being 160GiB VRAM (vs 320GiB on that monster machine). But before trying to use a multi-GPU setup to fine-tune a model I'd never worked with before, I figured it would be a good idea to try with the model that I've been using to date. The plan was:

Spin up a 1x A100 and run the same fine-tune as I've been doing on it.
If all was well, try again with the 8x A100 instance and start to learn about multi-GPU training.

Then once that was all done, I'd hopefully be in a good position to try the fine-tune for the bigger model.

Here's how it went.

[ Read more ]

Messing around with fine-tuning LLMs, part 2 -- to the cloud!

Posted on 28 April 2024 in AI, Python, Fine-tuning LLMS, TIL deep dives |

Having fine-tuned a 0.5B model on my own machine, I wanted to try the same kind of tuning, but with an 8B model. My experiments suggested to me that the VRAM required for the tuning was roughly linear with two meta-parameters -- the length of the samples and the batch size -- and I'd found resources online that suggested that it was also linear with the number of parameters.

The 16x scale going from 0.5B parameters to 8B would suggest that I would need 16x24GiB to run this fine-tune, which would be 384GiB. However, the chart I'd seen before suggested I could do it with a bit more than 160GiB -- that being the number they gave for a 7B parameter model.

What I clearly needed to do was find a decent-looking cloud GPU platform where I could start with a smaller machine and easily switch over to a larger one if it wasn't sufficient. Here are my first steps, running one of my existing fine-tune notebooks on a cloud provider.

[ Read more ]

Messing around with fine-tuning LLMs

Posted on 27 April 2024 in AI, Python, Fine-tuning LLMS, TIL deep dives |

Fine-tuning an LLM is how you take a base model and turn it into something that can actually do something useful. Base models are LLMs that have been trained to learn to predict the next word on vast amounts of text, and they're really interesting to play with, but you can't really have a conversation with one. When you ask them to complete some text, they don't know whether you want to complete it as part of a novel, a technical article, or an unhinged tweetstorm. (The obvious joke about which type of people the same applies to is left as an exercise for the reader.)

Chat-like AIs like ChatGPT become possible when a base model has been fine-tuned on lots of texts representing transcriptions (real or fake) of conversations, so that they specialise in looking at texts like this:

Human: Hello!

Bot: Hello, I'm a helpful bot.  What can I do for you today?

Human: What's the capital city of France?

Bot:

...and can work out that the next word should be something like "The", and then "capital", and so on to complete the sentence: "of France is Paris. Is there anything else I can help you with?"

Getting a solid intuition for how this all works felt like an interesting thing to do, and here are my lab notes on the first steps.

[ Read more ]

Building an AI chatbot for beginners: part 2

Posted on 4 April 2023 in AI, Python |

[Note that this series kind of dried up; when I started the series, I knew that I knew very little about the subject, but I was hoping to learn better by learning in public. However, as time went by it turned out that this wasn't working. There are a lot of better tutorials out there!]

Welcome to the second part of my tutorial on how to build a chatbot using OpenAI's interface to their Large Language Models (LLMs)! You can read the introduction here, and the first part here. As a reminder, I'm writing this not because I'm an expert, but because I'm learning how to do it myself, and writing about it helps me learn faster. Caveat lector :-)

In this post, we'll give the bot some memory of the conversation so far.

At the end of the first part, we had a program that would accept input from a user, combine it with some static text to make a prompt that an LLM would complete in the character of a chatbot (stopping at the point that the chatbot should stop, and not trying to carry on the conversation), then send it to OpenAI's API specifying an LLM model, and print out the result.

[ Read more ]

Building an AI chatbot for beginners: part 1

Posted on 19 March 2023 in AI, Python |

[Note that this series kind of dried up; when I started the series, I knew that I knew very little about the subject, but I was hoping to learn better by learning in public. However, as time went by it turned out that this wasn't working. There are a lot of better tutorials out there!]

Welcome to the first part of my tutorial on how to build a chatbot using OpenAI's interface to their Large Language Models (LLMs)! You can read the introduction here.

If you're reading this and want to get the best out of it, I strongly recommend that you run the code on your own machine as you go along: trust me, it will stick in your mind much better if you do that.

The goal in this post is to write a basic bot script that accepts user input, and just bounces it off an OpenAI LLM to generate a response.

[ Read more ]

Building an AI chatbot for beginners: part 0

Posted on 19 March 2023 in AI, Python |

[Note that this series kind of dried up; when I started the series, I knew that I knew very little about the subject, but I was hoping to learn better by learning in public. However, as time went by it turned out that this wasn't working. There are a lot of better tutorials out there!]

Like a lot of people, I've been blown away by the capabilities of Large Language Model (LLM) based systems over the last few months. I'm using ChatGPT regularly for all kinds of things, from generating basic code to debugging errors to writing emails.

I wanted to understand more about how these tools worked, and feel strongly that there's no better way to learn something than by doing it. Building an LLM is, at least right now, super-expensive -- in the millions of dollars (although maybe that will be coming down fast?). It also requires a lot of deep knowledge to get to something interesting. Perhaps something to try in the future, but not right now.

However, using LLMs to create something interesting -- that's much easier, especially because OpenAI have a powerful API, which provides ways to do all kinds of stuff. Most relevantly, they provide access to a Completion API. That, as I understand it, is the lowest-level way of interacting with an LLM, so building something out of it is probably the best bang for the buck for learning.

Over the last few weeks I've put together a bunch of things I found interesting, and have learned a lot. But it occurred to me that an even better way to learn stuff than by building it is to build it, and then explain it to someone else, even if that person is an abstract persona for "someone out there on the Internet". So: time for a LLM chatbot tutorial!

[ Read more ]

Python code to generate Let's Encrypt certificates

Posted on 16 November 2018 in Python, Cryptography, TIL deep dives |

I spent today writing some Python code to request certificates from Let's Encrypt. I couldn't find much in the way of simple sample code out there, so I thought it would be worth sharing some. It uses the acme Python package, which is part of the certbot client script.

It's worth noting that none of this is useful stuff if you just want to get a Let's Encrypt certificate for your website; scripts like certbot and dehydrated are what you need for that. This code and the explanation below are for people who are building their own systems to manage Let's Encrypt certs (perhaps for a number of websites) or who want a reasonably simple example showing a little more of what happens under the hood.

[ Read more ]