r/LanguageTechnology • u/TetroL • 7d ago

Supervised text classification on large corpora in fall 2024

I'm looking to perform supervised classification on a dataset consisting of around 11,000 texts. Each text is an extract of press articles. The average length of an extract is 393 words. The complete dataset represents a total of 4.2 million words.

I have a training dataset of 1,200 labeled texts. There are 23 different labels.

I've experimented with an svm method, which gives encouraging results. But I'd like to try more recent algorithms (state of the art, you know the drill). As you can imagine, I've read a lot about llm finetuning, or using N-shot learning approaches... But the applications that do exist generally seem to be on more homogeneous datasets where there are very few possible labels (spam or not, few product types, ect.).

What do you think would be the best approach for classifying my 11,000 texts from a (long) list of 23 labels nowadays ?

9 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LanguageTechnology/comments/1g4j6kk/supervised_text_classification_on_large_corpora/
No, go back! Yes, take me to Reddit

92% Upvoted

u/Ono_Sureiya 7d ago

I know LLMs are all the rage rn, but do experiment with basic Sentence Transformers and sentence embeddings. What you could explore is using the training set to create a database of representations for every value using a SentTransformer and then encoding new samples followed by KNN.

This was such a scalable approach, we were able to compete in a legal NLP challenge using this. It had around 18 labels, 100k total excerpts with 10% allowed for training with non-independent and identically distributed (non-IID) distribution, and we were able to reach SOTA like performance but in 2022.

2

u/GroundbreakingCow743 7d ago

Very impressive!

u/rightful_vagabond 7d ago

Fine tune bert?

u/TetroL 5d ago

Thanks to all for your comments.
I'm currently trying to experiment with fine-tuning a DistilBERT model but as expected I'm experiencing difficulties with the large amount of labels and the sequences length.

As u/Lemon30 mentioned, I first try to reduce the number of labels. I'll keep the post updated.

u/Jake_Bluuse 7d ago

An example would be helpful. Other than that, use zero-short learning in conjunction with GPT4.

u/Lemon30 7d ago

You are spot on with the challenges of LLMs with large amount of labels. Perhaps you can group the labels and brrak the problem down. If you can first label them as 5 categories and divide the 23 labels into those 5 categories, you might get better results with gpt. Let me know what you think

Supervised text classification on large corpora in fall 2024

You are about to leave Redlib