LLM Quantisation Weirdness

About

Contact

LLM Quantisation Weirdness

Posted on 27 February 2024 in AI

I bought myself an Nvidia RTX 3090 for Christmas to play around with local AI models. Serious work needs larger, more powerful cards, and it's easy (and not that expensive) to rent such cards by the minute from the likes of Paperspace. But the way I see it, I'm not going to be doing any serious work -- and what I really want to do is be able to run little experiments quickly and easily without worrying about spinning up a machine, getting stuff onto it, and so on.

One experiment that I tried the other day was to try to get a mental model of how model size and quantisation affect the quality of responses from LLMs. Quantisation is the process of running a model that has, say, 16 bits for each of its parameters with the parameters clipped to eight bits, four bits, or even less -- people have found that it often has a surprisingly small effect on output quality, and I wanted to play with that. Nothing serious or in-depth -- just trying stuff out with different model sizes and quantisations, and running a few prompts through them to see how the outputs differed.

I was comparing three sizes of the Code Llama HF model, with different quantisations:

codellama/CodeLlama-7b-Instruct-hf, which has 7b parameters, in "full-fat", 8-bit and 4-bit
codellama/CodeLlama-13b-Instruct-hf, which has 13b parameters, in 8-bit and 4-bit
codellama/CodeLlama-34b-Instruct-hf, which has 34b parameters, in 4-bit

Code Llama is a model from Meta, and the HF version ("Human Feedback") is designed to receive questions about programming (with specific formatting), and to reply with code. I chose those particular quantisations because the 13b model wouldn't fit in the 3090's 24GiB RAM without quantisation to a least 8-bit, and the 34b model would only fit if it was 4-bit quantised.

The quality of the response to my test question was not too bad with any of these, apart from codellama/CodeLlama-34b-Instruct-hf in 4-bit, which was often (but not always) heavily glitched with missing tokens -- that is, it was worse than codellama/CodeLlama-7b-Instruct-hf in 4-bit. That surprised me!

I was expecting quantisation to worsen the results, but not to make a larger model worse than a smaller one at the same level of quantisation. I've put a repo up on GitHub to see if anyone can repro these results, and to find out if anyone has any idea why it's happening.

Here's the 7b, 4-bit result from the notebook:

7b, 4-bit result

...and the 34b, 4-bit result -- it generates mixture of Java and Python, and the initial sentence is cut off:

34b, 4-bit result

In other runs I've seen it output glitched JavaScript or just have huge chunks of the output missing.

If there's any interest, I might try to build on this and try it across a larger set of prompts to see if it's a general issue, or just somehow specific to the one I used.

Thoughts welcome!