How to mathematically quantify the quality of a corpus?

Question

How to mathematically quantify the quality of a corpus?

hafiz031

2021年9月9日 03:46

I am working on a text classification project. I have around 60,000 text samples of 40intents. By calculating the frequency of each intents I found there is a class imbalance out there. But it is just a subjective decision which I made on my data. But apart from this is there any mathematical approaches by which I can generate an overall report on the quality of my training data? I am mainly focused on finding (mathematically):

If there is any class imbalance.
If the data is ambiguous, i.e, similar data overlapping multiple intents.
The quantification on mislabeled examples etc.

You can add more measurements because I am not quite sure on the possible tests can be made on this. Also, please suggest libraries and API's to obtain them.

Topic data-quality data visualization data-cleaning

Category Data Science

How to mathematically quantify the quality of a corpus?

About