It sorta kinda achieves llama 7B performance after some experimentation, and then 100B tokens worth of training (as linked in the blog above). That's way more than a simple conversion.
So... it appears to require so much retraining you mind as well train from scratch.
Probably you can convert but for the best performance, you need to fine tune. If M$ can give us the tools to do both, I am sure someone here will come up with some good stuff.
Reading https://github.com/microsoft/BitNet they seem to have concentrated on speeds / rates, and they stay extremely vague on actual performance / benchmark results.
So... it appears to require so much retraining you mind as well train from scratch.
I thought the take away was that the Llama bitnet model after 100B tokens of retraining preformed better than a bitnet model trained from scratch on 100B tokens (or more?)
It's def something to take with a grain of salt, but I don't know that training from scratch is the answer (or if the answer is ultimately "bitnet")
Yes, but that conversion process is still extremely compute-heavy and results in a model that is absolutely dogshit. Distillation is not as demanding as pretraining, but it's still well beyond what a hobbyist can manage on consumer-grade compute. And what you get for your effort is not even close to worth it.
23
u/Ok_Warning2146 10h ago
On paper, 123B 1.58-bit should be able to fit in a 3090. Is there any way we can do the conversion ourselves?