Conversation
|
So they were not trained from scratch? Just a bunch of tokens ran through old qwen 8b in a post-quantization scheme? |
No, the models were trained in 1 bit. Just like the BitNet models from Microsoft (except that MS BitNet uses ternary weights, so 1.58 bpw, while those are really just 1-bit). You don't get that kind of PPL with 1-bit post-training quantization. My guess is that this company is aiming to get funding so they can train a large model (100B+) |
|
That's what I initially thought too but then I saw people saying it was Qwen3-8b in the paper. The benchmaxx scores have it like a 2b model. I could understand if they were doing a fresh train with just that architecture. If it is really a larger "train" to get the model down to 1b weights it's misrepresented. |
|
Well, then I don't know. I didn't find details how the models were trained (but I also didn't put real effort into searching). But for sure you cannot get to such PPL with 1.125 bpw post-training quantization. The architecture of the models is identical to the corresponding Qwen3 dense models. |
|
Yeah, it seems there was some prior training / tuning. See an old blog post of theirs where they mentioned: "The 1-bit Bonsai 8B model is an 8-billion parameter Large Language Model where each parameter has 1-bit precision. It has been trained using Google v4 TPUs." Further evidence is that reasoning doesn't seem to work right with it yet. They don't go into much detail on how it was trained or compressed during conversion; just that it's some proprietary tech. |
|
The options are model from scratch, trained in 1bit.. like bitnet. And model quantized with high amount of tokens to make it 1-bit. The latter has certainly been attempted before and leads to the mentioned performance... being like a 1-4b or whatever. |
I spotted the Bonsai models (actually, there was a post on HN and that's how I "spotted" them). These are true 1-bit models (1 bit per weight plus a 16-bit scale per 128 weights, so effectively 1.125 bpw).
Don't know to what extend they can be useful, but I'm sure at least some are curious to try them.
So, here we go. This PR adds CPU-only support (
AVX2and generic implementation).In case you are curious about performance, here is what I get on a Ryzen-3995WX for the 4B model:
The bit packing that they have chosen is not optimal, but I did not want to do repacking and all that, so this is what it is for now. Still a very decent performance.
Here is the result of a
perplexityrun withwiki.test.raw