Attitude to text mining and preparing tokens, irrelevant words, low accuracy
For purpose of quite big project I am doing a text mining on some documents. My steps are quite common:
- All to lower case
- Tokenization
- Stop list and stop words
- Lemmatizaton
- Stemming
- Some other steps like removing symbols.
Then I prepare bag of words, make DTF and classify to 3 classes with SVM and Naive Bayes.
But the accuracy I get is not too high (50-60%). I think that may be because in array of words after all the steps are still many words very irrelevant like first and last name from document. What is the attitude in such case? What can be done during preprocessing to make classifiers work better with higher accuracy?
I was thinking about preparing some dictionary consists of all relevant words to my area but it could be too hard and for sure some important words will be missed.
Any advice what could be done here?
Topic classifier naive-bayes-classifier text-mining classification
Category Data Science