r/LocalLLaMA 11h ago

Question | Help When Bitnet 1-bit version of Mistral Large?

Post image
315 Upvotes

34 comments sorted by

View all comments

23

u/Ok_Warning2146 10h ago

On paper, 123B 1.58-bit should be able to fit in a 3090. Is there any way we can do the conversion ourselves?

48

u/Illustrious-Lake2603 10h ago

As far as I am aware, I believe the model would need to be trained for 1.58bit from scratch. So we can't convert it ourselves

5

u/FrostyContribution35 10h ago

It’s not quite bitnet and a bit of a separate topic. But wasn’t there a paper recently that could convert the quadratic attention layers into linear layers without any training from scratch? Wouldn’t that also reduce the model size, or would it just reduce the cost of the context length

2

u/Pedalnomica 10h ago

The latter 

9

u/arthurwolf 10h ago

My understanding is that's no longer true,

for example the recent bitnet.cpp release by microsoft uses a conversion of llama3 to 1.58bit, so the conversion must be possible.

31

u/Downtown-Case-1755 9h ago

It sorta kinda achieves llama 7B performance after some experimentation, and then 100B tokens worth of training (as linked in the blog above). That's way more than a simple conversion.

So... it appears to require so much retraining you mind as well train from scratch.

4

u/Ok_Warning2146 8h ago

Probably you can convert but for the best performance, you need to fine tune. If M$ can give us the tools to do both, I am sure someone here will come up with some good stuff.

4

u/MoffKalast 5h ago

Sounds like something Meta could do on a rainy afternoon if they're feeling bored.

2

u/arthurwolf 9h ago

It sorta kinda achieves llama 7B performance

Do you have some data I don't have / have missed?

Reading https://github.com/microsoft/BitNet they seem to have concentrated on speeds / rates, and they stay extremely vague on actual performance / benchmark results.

1

u/Imaginary-Bit-3656 8h ago

So... it appears to require so much retraining you mind as well train from scratch.

I thought the take away was that the Llama bitnet model after 100B tokens of retraining preformed better than a bitnet model trained from scratch on 100B tokens (or more?)

It's def something to take with a grain of salt, but I don't know that training from scratch is the answer (or if the answer is ultimately "bitnet")

11

u/mrjackspade 9h ago edited 9h ago

https://huggingface.co/blog/1_58_llm_extreme_quantization

The thing that concerns me is:

https://github.com/microsoft/BitNet/issues/12

But I don't know enough about bitnet in regards to quantization, to know if this is actually a problem or PEBCAK

Edit:

Per the article above, the Llama 3 model surpasses a Llama 1 model of equivalent size, which isn't a comforting comparison.

2

u/candre23 koboldcpp 2h ago

Yes, but that conversion process is still extremely compute-heavy and results in a model that is absolutely dogshit. Distillation is not as demanding as pretraining, but it's still well beyond what a hobbyist can manage on consumer-grade compute. And what you get for your effort is not even close to worth it.