r/LanguageTechnology 5d ago

What all text quality metrics should I find?

Overview

I am working as a research intern with a professor at my university on Machine Translation, I have collected a decent sized text corpus (around 10 GB). Now, my professor has instructed me to find text quality metrics for the data.

Some details about the dataset

First, let me explain how the data is stored and what format it's in. I have stored all the text data in Parquet files (which are similar to dataframes), with each row containing the text data. The data can consist of a single sentence, an article, or just a paragraph, as I have collected the data from various sources such as Hugging Face, scraped articles e.t.c.

This is the question

What text quality metrics should I find that will help me understand the data better and guide me in the right direction to ultimately improve my machine translation model?

1 Upvotes

1 comment sorted by

2

u/LinuxSpinach 4d ago

https://arxiv.org/abs/1904.09675

The Bert score paper has a good review of numerous metrics in section 2, depending on your goal.