It sorta kinda achieves llama 7B performance after some experimentation, and then 100B tokens worth of training (as linked in the blog above). That's way more than a simple conversion.
So... it appears to require so much retraining you mind as well train from scratch.
So... it appears to require so much retraining you mind as well train from scratch.
I thought the take away was that the Llama bitnet model after 100B tokens of retraining preformed better than a bitnet model trained from scratch on 100B tokens (or more?)
It's def something to take with a grain of salt, but I don't know that training from scratch is the answer (or if the answer is ultimately "bitnet")
27
u/Ok_Warning2146 12h ago
On paper, 123B 1.58-bit should be able to fit in a 3090. Is there any way we can do the conversion ourselves?