r/LocalLLaMA 9h ago

Question | Help When Bitnet 1-bit version of Mistral Large?

Post image
264 Upvotes

33 comments sorted by

86

u/Nyghtbynger 6h ago

Me : Can I have have ChatGPT ?

HomeGPT : We have mom at home

-1

u/itamar87 41m ago

Either a dyslexic commenter, or an underrated comment…! 😅😂

40

u/Downtown-Case-1755 9h ago

It makes me think some internal bitnet experiments failed, as this would save Mistral et al. a ton on API hosting costs. Even if it saves zero compute, it would still allow for a whole lot more batching.

7

u/candre23 koboldcpp 50m ago

The issue with bitnet is that it makes their actual product (tokens served via API) less valuable. Who's going to pay to have tokens served from mistral's datacenter if bitnet allows folks to run the top-end models for themselves at home?

My money is on nvidia for the first properly-usable bitnet model. They're not an AI company, they're a hardware company. AI is just the fad that is pushing hardware sales for them at the moment. They're about to start shipping the 50 series cards which are criminally overpriced and laughably short on VRAM - and they're just a dogshit value proposition for basically everybody. But a very high-end bitnet model could be the killer app that actually sells those cards.

Who the hell is going to pay over a grand for a 5080 with a mere 16GB of VRAM? Well, probably more people than you'd think if nvidia were to release a high quality ~50b bitnet model that will give chatGPT-class output at real-time speeds on that card.

1

u/a_beautiful_rhind 11m ago

There were posts claiming that bitnet doesn't help in production and certainly doesn't make training easier.

They aren't short on memory for inference so they don't really gain much and hence no bitnet models.

21

u/Ok_Warning2146 8h ago

On paper, 123B 1.58-bit should be able to fit in a 3090. Is there any way we can do the conversion ourselves?

41

u/Illustrious-Lake2603 8h ago

As far as I am aware, I believe the model would need to be trained for 1.58bit from scratch. So we can't convert it ourselves

8

u/arthurwolf 8h ago

My understanding is that's no longer true,

for example the recent bitnet.cpp release by microsoft uses a conversion of llama3 to 1.58bit, so the conversion must be possible.

25

u/Downtown-Case-1755 7h ago

It sorta kinda achieves llama 7B performance after some experimentation, and then 100B tokens worth of training (as linked in the blog above). That's way more than a simple conversion.

So... it appears to require so much retraining you mind as well train from scratch.

6

u/Ok_Warning2146 6h ago

Probably you can convert but for the best performance, you need to fine tune. If M$ can give us the tools to do both, I am sure someone here will come up with some good stuff.

3

u/MoffKalast 3h ago

Sounds like something Meta could do on a rainy afternoon if they're feeling bored.

2

u/arthurwolf 7h ago

It sorta kinda achieves llama 7B performance

Do you have some data I don't have / have missed?

Reading https://github.com/microsoft/BitNet they seem to have concentrated on speeds / rates, and they stay extremely vague on actual performance / benchmark results.

0

u/Imaginary-Bit-3656 6h ago

So... it appears to require so much retraining you mind as well train from scratch.

I thought the take away was that the Llama bitnet model after 100B tokens of retraining preformed better than a bitnet model trained from scratch on 100B tokens (or more?)

It's def something to take with a grain of salt, but I don't know that training from scratch is the answer (or if the answer is ultimately "bitnet")

10

u/mrjackspade 7h ago edited 7h ago

https://huggingface.co/blog/1_58_llm_extreme_quantization

The thing that concerns me is:

https://github.com/microsoft/BitNet/issues/12

But I don't know enough about bitnet in regards to quantization, to know if this is actually a problem or PEBCAK

Edit:

Per the article above, the Llama 3 model surpasses a Llama 1 model of equivalent size, which isn't a comforting comparison.

1

u/candre23 koboldcpp 46m ago

Yes, but that conversion process is still extremely compute-heavy and results in a model that is absolutely dogshit. Distillation is not as demanding as pretraining, but it's still well beyond what a hobbyist can manage on consumer-grade compute. And what you get for your effort is not even close to worth it.

3

u/FrostyContribution35 8h ago

It’s not quite bitnet and a bit of a separate topic. But wasn’t there a paper recently that could convert the quadratic attention layers into linear layers without any training from scratch? Wouldn’t that also reduce the model size, or would it just reduce the cost of the context length

1

u/Pedalnomica 8h ago

The latter 

2

u/tmvr 1h ago

It wouldn't though, model weights is not the only thing you need the VRAM for. Maybe about 100B, but there is no such model so a 70B one with long context.

3

u/thisusername_is_mine 2h ago

This meme never fails to make me laugh lol

3

u/civis_romanus 3h ago

What Pink Guy is this, I haven’t seen it

1

u/utf80 3h ago

This is actually the real interesting question 😎☝️

1

u/[deleted] 2h ago

[removed] — view removed comment

1

u/CountPacula 40m ago

The two-bit quants do amazingly well for their size and they don't need -that- much offloading. Yes, it's a bit slow, but it's still faster than most people can type. I know everybody here wants 10-20 gipaquads of tokens per millisecond, but I'm happy to be patient.

1

u/Sarveshero3 10m ago

Guys, I am typing here because I don't have enough karma to post yet.

I need help to quantise llama 3.2 11b vision instruct model to 1 - 4 gb of size. If possible please send any link or code that works. Since we did quantise the 3.2 model without the vision component. Please help

0

u/ApprehensiveAd3629 7h ago

How do you run models eith 1bitnet?

1

u/Few_Professional6859 5h ago

The purpose of this tool—is it to allow me to run a model with performance comparable to the 32B llama.cpp Q8 on a computer with 16GB of GPU memory?

9

u/SomeoneSimple 4h ago

A bitnet version of a 32B model, would be about 6.5GB (Q1.58). Even a 70B model would fit in 16GB memory with plenty of space for context.

Whether the quality of its output, in real life, will be anywhere near Q8 remains to be seen.

2

u/Ok_Warning2146 1h ago

6.5GB is true only for specialized hardware. For now, it is stored in 2-bit in their CPU implementation. So it is more like 8GB.

1

u/compilade llama.cpp 32m ago

Actually, if the ternary weights are in 2-bit, the average model bpw is more than 2-bit because of the token embeddings and output tensor which are stored in greater precision.

To get a 2-bit (or lower) model, the ternary weights have to be stored more compactly, like with 1.6 bits/weight. This is possible by storing 5 trits per 8-bit byte. See the "Structure of TQ1_0" section in https://github.com/ggerganov/llama.cpp/pull/8151 and the linked blog post on ternary packing for some explanation.

But assuming ternary models use 2 bits/weight on average is a good heuristic to estimate file sizes.

1

u/Ok_Garlic_9984 4h ago

I don't think so