Attitude to text mining and preparing tokens, irrelevant words, low accuracy

For purpose of quite big project I am doing a text mining on some documents. My steps are quite common:

  1. All to lower case
  2. Tokenization
  3. Stop list and stop words
  4. Lemmatizaton
  5. Stemming
  6. Some other steps like removing symbols.

Then I prepare bag of words, make DTF and classify to 3 classes with SVM and Naive Bayes.

But the accuracy I get is not too high (50-60%). I think that may be because in array of words after all the steps are still many words very irrelevant like first and last name from document. What is the attitude in such case? What can be done during preprocessing to make classifiers work better with higher accuracy?

I was thinking about preparing some dictionary consists of all relevant words to my area but it could be too hard and for sure some important words will be missed.

Any advice what could be done here?

Topic classifier naive-bayes-classifier text-mining classification

Category Data Science


The question is very broad, so the general answer is: it depends on the specifics of your data and the problem you're trying to solve. The general idea would be to analyze what's happening:

  • Is it actually possible to solve the problem given the indications in the data? Would a human expert do better than 50-60% accuracy given only the information in the data? If yes which clues would they use, and are these potential clues directly available as features to the learning algorithm?
  • Ratio between size of data and size of features: if there are not enough instances, the algorithm doesn't have enough to generalize from. If there are too many features, the algorithm is likely to overfit (i.e. consider things which happen by chance in the data as significant patterns). And of course if there are not enough features the algorithm doesn't have the indications necessary to make accurate predictions.

Common things to try:

  • remove features (here words) which appear very rarely, because they can't help and can cause overfitting.
  • if the problem seems to be that words are not informative enough, try with bigrams or even trigrams... but be careful to avoid overfitting, the number of features is likely to increase a lot.
  • use weights (e.g. TFIDF) to help the learning algorithm focus on the relevant features.
  • depending on the task, more sophisticated methods could be relevant: topic modeling, disambiguation techniques, syntax...

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.