How to mathematically quantify the quality of a corpus?

I am working on a text classification project. I have around 60,000 text samples of 40intents. By calculating the frequency of each intents I found there is a class imbalance out there. But it is just a subjective decision which I made on my data. But apart from this is there any mathematical approaches by which I can generate an overall report on the quality of my training data? I am mainly focused on finding (mathematically):

  • If there is any class imbalance.
  • If the data is ambiguous, i.e, similar data overlapping multiple intents.
  • The quantification on mislabeled examples etc.

You can add more measurements because I am not quite sure on the possible tests can be made on this. Also, please suggest libraries and API's to obtain them.

Topic data-quality data visualization data-cleaning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.