How to mathematically quantify the quality of a corpus?
I am working on a text classification project. I have around 60,000 text samples of 40intents. By calculating the frequency of each intents I found there is a class imbalance out there. But it is just a subjective decision which I made on my data. But apart from this is there any mathematical approaches by which I can generate an overall report on the quality of my training data? I am mainly focused on finding (mathematically):
- If there is any class imbalance.
 - If the data is ambiguous, i.e, similar data overlapping multiple intents.
 - The quantification on mislabeled examples etc.
 
You can add more measurements because I am not quite sure on the possible tests can be made on this. Also, please suggest libraries and API's to obtain them.
Topic data-quality data visualization data-cleaning
Category Data Science