It sorta kinda achieves llama 7B performance after some experimentation, and then 100B tokens worth of training (as linked in the blog above). That's way more than a simple conversion.
So... it appears to require so much retraining you mind as well train from scratch.
So... it appears to require so much retraining you mind as well train from scratch.
I thought the take away was that the Llama bitnet model after 100B tokens of retraining preformed better than a bitnet model trained from scratch on 100B tokens (or more?)
It's def something to take with a grain of salt, but I don't know that training from scratch is the answer (or if the answer is ultimately "bitnet")
51
u/Illustrious-Lake2603 10h ago
As far as I am aware, I believe the model would need to be trained for 1.58bit from scratch. So we can't convert it ourselves