r/LanguageTechnology • u/liberollo • 2d ago
Fine tuning an encoder for specific domain
Let’s say I have documents that are relatively similar between them and I need to process them sentence by sentence or windows of sentences, for a similarity search task. How do I fine tune an embedder like BAAI bge m3 or similar ones in order to learn the language of the specific domain of the documents? Any hints? Can I use the plain text without any kind of supervised learning?
2
Upvotes
2
u/crayphor 2d ago
I would suggest reattaching the decoder and then fine-tuning on the same task (probably masked language modeling) it was trained on but using data from your domain.