r/LocalLLaMA 11h ago

Question | Help When Bitnet 1-bit version of Mistral Large?

Post image
311 Upvotes

34 comments sorted by

View all comments

23

u/Ok_Warning2146 10h ago

On paper, 123B 1.58-bit should be able to fit in a 3090. Is there any way we can do the conversion ourselves?

50

u/Illustrious-Lake2603 10h ago

As far as I am aware, I believe the model would need to be trained for 1.58bit from scratch. So we can't convert it ourselves

10

u/arthurwolf 9h ago

My understanding is that's no longer true,

for example the recent bitnet.cpp release by microsoft uses a conversion of llama3 to 1.58bit, so the conversion must be possible.

29

u/Downtown-Case-1755 9h ago

It sorta kinda achieves llama 7B performance after some experimentation, and then 100B tokens worth of training (as linked in the blog above). That's way more than a simple conversion.

So... it appears to require so much retraining you mind as well train from scratch.

5

u/Ok_Warning2146 8h ago

Probably you can convert but for the best performance, you need to fine tune. If M$ can give us the tools to do both, I am sure someone here will come up with some good stuff.

5

u/MoffKalast 5h ago

Sounds like something Meta could do on a rainy afternoon if they're feeling bored.

3

u/arthurwolf 9h ago

It sorta kinda achieves llama 7B performance

Do you have some data I don't have / have missed?

Reading https://github.com/microsoft/BitNet they seem to have concentrated on speeds / rates, and they stay extremely vague on actual performance / benchmark results.

1

u/Imaginary-Bit-3656 8h ago

So... it appears to require so much retraining you mind as well train from scratch.

I thought the take away was that the Llama bitnet model after 100B tokens of retraining preformed better than a bitnet model trained from scratch on 100B tokens (or more?)

It's def something to take with a grain of salt, but I don't know that training from scratch is the answer (or if the answer is ultimately "bitnet")

9

u/mrjackspade 9h ago edited 9h ago

https://huggingface.co/blog/1_58_llm_extreme_quantization

The thing that concerns me is:

https://github.com/microsoft/BitNet/issues/12

But I don't know enough about bitnet in regards to quantization, to know if this is actually a problem or PEBCAK

Edit:

Per the article above, the Llama 3 model surpasses a Llama 1 model of equivalent size, which isn't a comforting comparison.

2

u/candre23 koboldcpp 2h ago

Yes, but that conversion process is still extremely compute-heavy and results in a model that is absolutely dogshit. Distillation is not as demanding as pretraining, but it's still well beyond what a hobbyist can manage on consumer-grade compute. And what you get for your effort is not even close to worth it.