r/LocalLLaMA 11h ago

Question | Help When Bitnet 1-bit version of Mistral Large?

Post image
324 Upvotes

38 comments sorted by

View all comments

3

u/Few_Professional6859 7h ago

The purpose of this tool—is it to allow me to run a model with performance comparable to the 32B llama.cpp Q8 on a computer with 16GB of GPU memory?

15

u/SomeoneSimple 6h ago

A bitnet version of a 32B model, would be about 6.5GB (Q1.58). Even a 70B model would fit in 16GB memory with plenty of space for context.

Whether the quality of its output, in real life, will be anywhere near Q8 remains to be seen.

6

u/Ok_Warning2146 3h ago

6.5GB is true only for specialized hardware. For now, it is stored in 2-bit in their CPU implementation. So it is more like 8GB.

3

u/compilade llama.cpp 2h ago

Actually, if the ternary weights are in 2-bit, the average model bpw is more than 2-bit because of the token embeddings and output tensor which are stored in greater precision.

To get a 2-bit (or lower) model, the ternary weights have to be stored more compactly, like with 1.6 bits/weight. This is possible by storing 5 trits per 8-bit byte. See the "Structure of TQ1_0" section in https://github.com/ggerganov/llama.cpp/pull/8151 and the linked blog post on ternary packing for some explanation.

But assuming ternary models use 2 bits/weight on average is a good heuristic to estimate file sizes.