How to mathematically quantify the quality of a corpus?
I am working on a text classification project. I have around 60,000
text samples of 40
intents. By calculating the frequency of each intents I found there is a class imbalance out there. But it is just a subjective decision which I made on my data. But apart from this is there any mathematical approaches by which I can generate an overall report on the quality of my training data? I am mainly focused on finding (mathematically):
- If there is any class imbalance.
- If the data is ambiguous, i.e, similar data overlapping multiple intents.
- The quantification on mislabeled examples etc.
You can add more measurements because I am not quite sure on the possible tests can be made on this. Also, please suggest libraries and API's to obtain them.
Topic data-quality data visualization data-cleaning
Category Data Science