Do reasoning LLMs need their own Philosophical Language?
A few days ago, I saw a cluster of tweets about OpenAI's o1 randomly switching to Chinese while reasoning -- here's a good example. I think I've seen it switch languages a few times as well. Thinking about it, Chinese -- or any other language written in a non-Latin alphabet -- would be particularly noticeable, because those notes describing what it's thinking about flash by pretty quickly, and you're only really likely to notice something weird if it's immediately visibly different to what you expect. So perhaps it's spending a lot of its time switching from language to language depending on what it's thinking about, and then it translates back to the language of the conversation for the final output.
Why would it do that? Presumably certain topics are covered better in its training set in specific languages -- it will have more on Chinese history in Chinese, Russian history in Russian, and so on. But equally possibly, some languages are easier for it to reason about certain topics in. Tiezhen Wang, a bilingual AI developer, tweeted that he preferred doing maths in Chinese "because each digit is just one syllable, which makes calculations crisp and efficient". Perhaps there's something similar there for LLMs.
That got me thinking about the 17th-century idea of a Philosophical Language. If you've read Neal Stephenson's Baroque Cycle books, you'll maybe remember it from there -- that's certainly where I heard about it. The idea was that natural human languages were not very good for reasoning about things, and the solution would be to create an ideal, consciously-designed language that was more rational. Then philosophers (or scientists as we'd say these days) could work in it and get better results.
There are echos of that in E' (E-Prime), another one I picked up on from fiction (this time from The Illuminatus! Trilogy). It's English, without the verb "to be", the idea being that most uses of the word are unnecessarily foggy and would be better replaced. "Mary is a doctor" implies that her job is the important thing about her, whereas "Mary practices medicine" is specific that it's one just aspect of her. What I like about it is that it -- in theory -- gets a more "Philosophical" language with a really small tweak rather than a complete redesign.
What I'm wondering is, are human languages really the right way for LLMs to be reasoning if we want accurate results quickly? We all know how easy it is to be bamboozled by words, either our own or other people's. Is there some way we could construct a language that would be better?
The baroque philosophers ultimately failed, and modern scientists tend to switch to mathematics when they need to be precise ("physics is a system for translating the Universe into maths so that you can reason about it" -- discuss).
But perhaps by watching which languages o1 is choosing for different kinds of reasoning we could identify pre-existing (grammatical/morphological/etc) structures that just seem to work better for different kinds of tasks, and then use that as a framework to build something on top of. That feels like something that could be done much more easily now than it could in the pre-LLM world.
Or maybe a reasoning language is something that could be learned as part of a training process; perhaps each LLM could develop its own, after pre-training with human languages to get it to understand the underlying concept of "language". Then it might better mirror how LLMs work -- its structures might map more directly to the way transformers process information. It might have ways of representing things that you literally could not describe in human languages.
Think of it as a machine code for LLMs, perhaps. Is it a dumb idea? As always, comments are open :-)
Writing an LLM from scratch, part 5 -- more on self-attention
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it. In retrospect, it was kind of adorable that I thought I could get it all done over my Christmas break, given that I managed just the first two-and-a-half chapters! However, now that the start-of-year stuff is out of the way at work, hopefully I can continue. And at least the two-week break since my last post in this series has given things some time to stew.
In the last post I was reading about attention mechanisms and how they work, and was a little thrown by the move from attention to self-attention, and in this blog post I hope to get that all fully sorted so that I can move on to the rest of chapter 3, and then the rest of the book. Rashka himself said on X that this chapter "might be the most technical one (like building the engine of a car) but it gets easier from here!" That's reassuring, and hopefully it means that my blog posts will speed up too once I'm done with it.
But first: on to attention and what it means in the LLM sense.
Happy New Year!
A very happy New Year to all for 2025!
Just a quick note to say that I'm not starting one of my periodic blogging holidays. The New Year and then starting work again afterwards has been quite busy, so I've not had much time to continue with my readthrough of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I was hoping to get some done this weekend, but personal commitments got in the way.
I have had some downtime, but just enough to read some fiction -- I'm rereading Connie Willis's excellent "Oxford Time Travel" series, out of order -- I started with the wonderfully whimsical "To Say Nothing of the Dog", and am now most of the way through the significantly less cheerful but deeply moving "Doomsday Book".
I'm also having a lengthy pair of on-and-off discussions with Claude and ChatGPT o1 about the history of attention mechanisms, so hopefully when I get back to the LLM book I'll be better prepared :-)
An AI chatroom (a few steps further)
Still playing hooky from "Build a Large Language Model (from Scratch)" -- I was on our support rota today and felt a little drained afterwards, so decided to finish off my AI chatroom. The the codebase is now in a state where I'm reasonably happy with it -- it's not production-grade code by any stretch of the imagination, but the structure is acceptable, and it has the basic functionality I wanted:
- A configurable set of AIs
- Compatibility with the OpenAI API (for OpenAI itself, Grok and DeepSeek) and with Anthropic's (for Claude).
- Persistent history so that you can start a chat and have it survive a restart of the bot.
- Pretty reasonable behaviour of the AIs, with them building on what each other say.
An AI chatroom (beginnings)
So, I know that I decided I would follow a "no side quests" rule while reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", but rules are made to be broken.
I've started building a simple Telegram bot that can be used to chat with multiple AI models at the same time, the goal being to allow them to have limited interaction with each other. I'm not sure if it's going to work well, and it's very much a work-in-progress -- but here's the repo.
More info below the fold.
Writing an LLM from scratch, part 4
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.
Here's a link to the previous post in this series.
Today I read through chapter 3, which introduces and explains attention mechanisms -- the core architecture that allows LLMs to "understand" the meaning of text in terms of the relationships between words. This feels like the core of the book; at least, for me, it's the part of the underlying workings of LLMs that I understand the least. I knew it was something to do with the LLM learning which other words to pay attention to when looking at a particular one, but that's pretty much it.
And it's a tough chapter. I finished with what I felt was a good understanding at a high level of how the calculations that make up self-attention in an LLM work -- but not of how self-attention itself works. That is, I understood how to write one, in terms of the steps to follow mathematically, but not why that specific code would be what I would write or why we would perform those mathematical operations.
I think this was because I tried to devour it all in a day, so I'm going to go through much more slowly, writing up notes on each section each day.
Today, I think, I can at least cover the historical explanation of how attention mechanisms came to be in the first place, because that seems reasonably easy to understand.
Writing an LLM from scratch, part 3
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.
Here's a link to the previous post in this series.
Today I was working through the second half of Chapter 2, "Working with text data", which I'd started just before Christmas. Only two days off, so it was reasonably fresh in my mind :-)
Writing an LLM from scratch, part 2
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and planning to post every day (or at least, every day I read some of it -- Christmas day I suspect I'll not be posting) with notes on what I found interesting.
Here's a link to the previous post in this series.
I had been planning to do a chapter a day, but that is looking optimistic for such a dense book! So today, I've read the first half or so of Chapter 2, "Working with text data". This gives an overview of the pre-processing that happens to text before it hits the LLM, goes on to describe a simple tokenization system (complete with source code), and then briefly covers the byte pair encoding method that we'll actually be using for the LLM.
Writing an LLM from scratch, part 1
Over the Christmas break (and probably beyond) I'm planning to work through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm expecting to get through a chapter or less a day, in order to give things time to percolate properly. Each day, or perhaps each chapter, I'll post here about anything I find particularly interesting.
Today, it was what is most likely the easiest bit; the introductory chapter 1, "Understanding large language models".
Messing around with fine-tuning LLMs, part 10 -- finally training the model!
For many months now, I've intermittently been working on building code to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA. I've been taking my time and letting myself be diverted by anything that looked interesting along the way, because the goal was to learn as much as possible about how this stuff works rather than to achieve a simple goal.
But all good things must come to an end. In this post I'll document the final steps of fine-tuning the model and getting it posted on Hugging Face.
Just to summarise what's happened so far, I have:
- Fine-tuned a 0.5B model on my own machine.
- Done the same, but in the cloud using Lambda Labs.
- Run some multi-GPU training, but using the GPUs to run larger batches for the 0.5B model -- which in turn means training faster -- rather than to train a larger model.
- Successfully fine-tuned the 8B model across multiple GPUs using ZeRO and DeepSpeed, but with the optimizer offloaded to CPU.
- Done some initial experiments into memory usage for a 0.5B model locally to get some ideas as to why I had to offload the optimizer.
- Measured memory usage more systematically for the 0.5B model, also locally, to find out how it behaves with different sequence lengths.
- Made similar measurements at different sequence lengths for the 8B model.
- Measured the effect of batch sizes on memory usage, with a sidetrack into looking at Liger Kernel, a new and easy-to use replacement of the default CUDA kernels used for training that promises (and delivers) better memory usage and performance.
- Investigated how gradient checkpointing works, in the hope that it might allow me to trade off GPU processing for memory usage and get a larger batch size (meaning that each training iteration was slower, but the overall train took less time). Sadly, those hopes were dashed.
There are still a bunch of open questions left from those earlier posts, and I'm planning to do a summary post listing what those are (and, indeed what I've learned). But I figured it really was time to wrap up the final step. Check it out:
$ python chat_with_model.py gpjt/Meta-Llama-3-8B-openassistant-guanaco-llama2-format
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|============================================================================================================| 4/4 [00:03<00:00, 1.31it/s]
You: Please tell me about Leonardo da Vinci.
Bot: Sure, Leonardo da Vinci was an Italian polymath and artist who lived in the
late 15th and early 16th centuries. He is considered one of the greatest minds of
the Renaissance era, and is remembered as one of the most influential artists of
all time. Leonardo was a master of many fields, including painting, sculpting,
architecture, engineering, and science. He is best known for his paintings, such
as the Mona Lisa and The Last Supper, and for his inventions, such as the flying
machine and the helicopter. Leonardo was also a skilled mathematician, and made
important contributions to fields such as cryptography and astronomy. He died at
the age of 67, without ever fully completing any of his many projects.
251 tokens in 6.11s: 41.10 tokens/s
You: Interesting! Could you expand on what the Mona Lisa is?
Bot: The Mona Lisa is a painting by Leonardo da Vinci that depicts a portrait
of a woman from the Renaissance era. It is one of the most famous paintings in
the world, and is considered one of the most recognizable pieces of art ever
created. The painting has been the subject of numerous studies and
interpretations, and is often used as an example of the aesthetic and
intellectual values of the Renaissance period.
79 tokens in 2.01s: 39.40 tokens/s
You: Where is it? I would really like to see it.
Bot: The Mona Lisa is located in the Louvre Museum in Paris, France. It is one
of the most visited museums in the world, and is a must-see destination for
anyone interested in art and culture. The Louvre Museum is open to the public,
and admission is free on the first Sunday of every month. Tickets can also be
purchased online or at the museum entrance. There are also many other museums and
attractions in Paris, such as the Eiffel Tower and the Notre Dame Cathedral, that
are worth visiting while you are in the city.
112 tokens in 2.81s: 39.90 tokens/s
You: Sounds cool! What food should I try while I'm there?
Bot: While you are in Paris, you should definitely try some of the delicious
French food! Some must-see dining spots are:
- Le Relais de l'EntrecĂ´te: a classic French bistro where you can enjoy a delicious steak with a side of French fries.
- L'Ami Jean: a hip and trendy restaurant that serves up delicious French food with a modern twist.
- Le Bistrot Paul Bert: a charming bistro that serves up classic French dishes with a focus on fresh, local ingredients.
- L'Assiette: a dining room that serves up delicious French food in a beautiful dining room.
- L'Ami Louis: a classic French restaurant that has been serving up delicious food for over 100 years.
I hope you enjoy your trip to Paris, and that you have a delicious dining experience
while you are there!
174 tokens in 4.38s: 39.70 tokens/s
And so on. The script is here,
and you can run it against any model that accepts the
Llama 2 prompt format
-- see the command I used to start it above.
There's an optional --debug
flag so that you can see everything sent to and
from the model.
So, about this one: it can't count the "r"s in strawberry, but it's a working assistant bot! Success :-)
Let's dig in to how it was fine-tuned.