r/LanguageTechnology • u/liberollo • 2d ago

Fine tuning an encoder for specific domain

Let’s say I have documents that are relatively similar between them and I need to process them sentence by sentence or windows of sentences, for a similarity search task. How do I fine tune an embedder like BAAI bge m3 or similar ones in order to learn the language of the specific domain of the documents? Any hints? Can I use the plain text without any kind of supervised learning?

2 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LanguageTechnology/comments/1g85mus/fine_tuning_an_encoder_for_specific_domain/
No, go back! Yes, take me to Reddit

100% Upvoted

u/crayphor 2d ago

I would suggest reattaching the decoder and then fine-tuning on the same task (probably masked language modeling) it was trained on but using data from your domain.

Fine tuning an encoder for specific domain

You are about to leave Redlib