Messing around with fine-tuning LLMs, part 2 -- to the cloud!

Posted on 28 April 2024 in Programming, Python, AI, Fine-tuning LLMS

Having fine-tuned a 0.5B model on my own machine, I wanted to try the same kind of tuning, but with an 8B model. My experiments suggested to me that the VRAM required for the tuning was roughly linear with two meta-parameters -- the length of the samples and the batch size -- and I'd found resources online that suggested that it was also linear with the number of parameters.

The 16x scale going from 0.5B parameters to 8B would suggest that I would need 16x24GiB to run this fine-tune, which would be 384GiB. However, the chart I'd seen before suggested I could do it with a bit more than 160GiB -- that being the number they gave for a 7B parameter model.

What I clearly needed to do was find a decent-looking cloud GPU platform where I could start with a smaller machine and easily switch over to a larger one if it wasn't sufficient. Here are my first steps, running one of my existing fine-tune notebooks on a cloud provider.

Choosing a provider

The first thing I needed was a list of providers. I'm very used to AWS, but they have a reputation for being very expensive for GPUs, so I decided to exclude them. The options I looked at were:

What follows isn't a review of these hosts, just the impressions I got when taking a look.

Paperspace

I was initially impressed by the clarity of their pricing page, with the different cards listed with a price for each. It wasn't immediately clear to me how much multi-GPU instances cost from the table itself, but then I spotted that at the bottom it says "Multiply the specs and price of each type above by the number of GPUs." So, simple enough -- double the price for an instance with GPU X to get the price for an instance with two, and you'll also get double the RAM, double the CPUs, and so on.

So what about the pricing? Well, initially it looked pretty cheap -- $2.24/hour for an H100. But then I saw an asterisk next to the price and looked at the small print:

$2.24/hour pricing is for a 3-year commitment. On-demand pricing for H100 is $5.95/hour under our special promo price.

That left a somewhat bad taste in my mouth. It's 100% reasonable to give deep discounts if someone is committing to rent a server long-term -- we're big users of AWS Reserved Instances at PythonAnywhere, and give discounts for annual subscriptions -- but putting the long-term price up as the headline price and hiding the true on-demand price in 8-point small print is, to my mind, a bit underhand.

Maybe it's standard practice in this industry, though? Let's see the others.

Lambda Labs

At the top of their front page, there's a "Cloud" menu that has "on demand" and "reserved" as options, so you're immediately going to a page for the kind of rental you want, which is nice (and hopefully avoids the problem I had with Paperspace). On the "On demand" page, there's a link to "On-demand GPU cloud pricing". They list the different machine types they have in a table, and there's a price for each. I saw an asterisk there, and thought "uh-oh", but it just goes to the caveat that the prices exclude applicable sales tax, which is entirely reasonable. After all, they can't know how much tax they're going to have to charge until they know where you are.

They quote $2.49 for a single H100 machine, which is an on-demand price that's not much more than the price at Paperspace for a three-year commitment! And they have a good range of other machine sizes too. One useful baseline for me is that they have a 1x Nvidia A10 machine, which has 24GiB, for $0.75/hour. I have spun up similar machines on AWS a number of times for various experiments, and they cost about $1/hour there. So AWS's reputation for being pricey is not undeserved.

Vast.ai

Looking more closely at their site, I remembered where I'd heard of them before. They're not a cloud provider -- instead, then allow people with GPUs to rent them out, and Vast.ai acts as a kind of intermediary matching buyers to sellers. It's a really clever idea, and I'll keep an eye on them, but they're not what I'm looking for right now.

Vultr

They're more of a general PaaS than a GPU-specific cloud provider. They have a huge array of instance types; unfortunately on their pricing page they just have prices for a "36-month, 100% prepaid reserved instance contract". They're a little more up-front about it than Paperspace, but they don't seem to have on-demand "pay for what you use" pricing.

Decision time

It's pretty obvious which looks best, at least at the superficial level I've investigated so far. Let's use Lambda Labs!

Signing up and spinning up an instance

Signing up was the normal kind of thing, but getting payments set up was a little tricky; I use debit cards for pretty much everything, and their card entry form happily let me enter the details of one of those. However, when I did that and then confirmed the charge in my online banking app, Lambda Labs gave me a generic "something went wrong" error. The error linked to a page where the first item said that they only accept credit cards. That's kind of odd -- most companies accept debit cards. Indeed, until our monthly spend hit the limit for our card, that's how we paid for AWS at PythonAnywhere.

Still, I'm sure they know their business and have a good reason for it. My guess is that it's something to do with card fraud and/or chargebacks, or with people trying to challenge large bills when they left big instances running. I've never seen chargeback-related differences between credit and debit cards when on the other side of the transaction, but perhaps they have.

Anyway, now that I was logged in and set up, I decided to start something super-cheap just to test the water. There was a limited set actually available to me -- of the 14 instance types on the popup, only 4 were ones I could start, and the others were all greyed out with "Request instance" as an option. Maybe that's due to availability, or maybe it's some kind of limitation on new accounts. However, the options I had were:

PCIe is, of course, the normal standard for connecting a card to a motherboard; a quick Google suggests that SXM4 is an Nvidia-specific thing giving faster connectivity -- from a cursory glance, it looks like the GPU is basically in a socket on the motherboard just like the CPU. Cool!

Anyway -- those options looked decent enough for what I was trying to do, so I decided to spin up the cheapest option, the 1x A10, and give it a whirl.

After selecting the instance type and the region (Virginia, as it's a bit closer to me than California), I was asked to select a filesystem to attach or to create one. I later discovered that this was essentially some disk that was mounted into my home directory. It sounds kind of useful -- you could spin up a cheap instance, download models and training sets and stuff, and then shut that instance down and spin up an expensive one for a training run, so that you don't waste time downloading stuff and leaving lots of GPUs idle.

That said, they do have the strange caveat:

Data stored in filesystems in specific regions may be inaccessible after you terminate due to limited instance availability. Learn more

If I decide to stick with them, I think I'll need to look into that further.

Update: I later discovered that it's more complicated than that. They can apparently store packages installed with pip, apt and conda. Not sure how they do that, though, given that they really are mounted into the home directory. The weird thing was that the one that I created for this blog post was somehow over 300GiB in size; at US$0.20/GiB per month, that cost me US$10 in the week or so before I spotted it in the billing system and deleted it. Something to keep a close eye on! However, the next time through I saw a "Don't attach a filesystem" option; perhaps that's best for now.

Anyway, I created one called my-first-filesystem and on to the next step. They wanted an SSH public key, which is a sensible quick and easy way to get login credentials sorted, and then they dump a bunch of licenses to accept in front of you (including something telling you that if you use it for mining crypto they'll shut you down, which is reasonable given the amount of fraud in that space), and then we're done! "Launching instance". A minute or so later, it appeared on the list of instances with status "booting". After a few minutes, an IP address appeared; at about 6 minutes after starting the instance I tried SSHing in there (with the username ubuntu) and was in!

nvtop was already installed and showed the hardware is correct. bashtop wasn't, but sudo worked so I could install it. By the time that was done, the instance was showing up as "Running".

Now, they say that you can launch Jupyter notebooks on your instance; it wasn't obvious to me how to do that but after a quick Google I discovered that the "Cloud IDE" column at the far right of the instances table, where it had a link "Launch", was the way to do that. Clicking it spun up a JupyterLab instance, which is great!

I figured that the best way to verify that this instance worked and did what I wanted would be to try running one of the the notebooks that I already had.

Running my own notebook in the cloud

A quick git clone of the repo was anough to get the notebooks in. However, I needed a virtualenv with my requirements.txt installed. It was quick and easy to get virtualenvwrapper installed and sort that out.

sudo apt install virtualenvwrapper
source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
mkvirtualenv fine-tune
pip install -r requirements.txt

There was a decent network connection; packages were coming down at about 200MiB/s. So everything was installed in a few minutes.

Unfortunately the new virtualenv didn't appear in the list of available kernels in JupyterLab -- but then I remembered that there's an extra step required on PythonAnywhere to register a virtualenv as a kernel:

ipython kernel install --name "venv-name" --user

I ran that, and then refreshed the page showing JupyterLab, and it was available!

I switched the notebook over to using it, and ran it... and it worked! Slightly disappointingly, it was actually slower than it was locally -- I was running the second notebook from the last blog post, and while I got about 90 tokens/second in the initial inference on the untuned base model, the cloud instance got 63. And for the training it took 2h21m rather than 1h36m.

That makes sense, though, thinking about it -- although the A10 and the RTX 3090 are the same generation of Nvidia chip ("Ampere"), the A10 was released almost a year earlier and (assuming this comparison page can be trusted) has lower memory bandwidth and fewer shader units.

Conclusion

So, that was a good first step! I took a notebook that I had run locally to fine-tune a 0.5B model on a dataset to get it to respond to instructions, and ran it on a roughly equivalent machine on a cloud provider. All-in, it took just under three hours of instance runtime, so it cost $2.16 (plus, I imagine, local VAT at 23% because I'm in Portugal).

I think that the next step should be to run the same notebook on a much more powerful multi-GPU machine, bumping up the batch size so that I use more memory than fits into one of those GPUs. That will get me some exposure to multi-GPU training. Do I need to do anything special for that? Or does CUDA abstract that all away?

Let's find out next time.