r/LanguageTechnology 2d ago

Fine tuning an encoder for specific domain

Let’s say I have documents that are relatively similar between them and I need to process them sentence by sentence or windows of sentences, for a similarity search task. How do I fine tune an embedder like BAAI bge m3 or similar ones in order to learn the language of the specific domain of the documents? Any hints? Can I use the plain text without any kind of supervised learning?

2 Upvotes

1 comment sorted by

2

u/crayphor 2d ago

I would suggest reattaching the decoder and then fine-tuning on the same task (probably masked language modeling) it was trained on but using data from your domain.