Classify documents using a set of known vocabularies

I have a bunch of documents that I want to classify which ones talk about soccer (unsupervised learning, I do not want to manually label the documents).

One way I am thinking about is to go online and search for the most popular words in soccer articles to make a list of vocabularies (for example: score, shoot, World Cup, etc). Then somehow use that list of vocabularies to classify the documents (maybe if a particular contains 30% of the words in that list of vocabularies, then that document talks about soccer).

I am wondering whether it is a valid method or there are better existing methods. Really appreciate any help.

Topic text-mining topic-model nlp

Category Data Science


First of all you need to have available a train set, which means that you should annotate manually which document is related to soccer and which not. Then you need to process the available corpus (remove numbers, stop-words etc., stemming) and build a vocabulary. After that you should choose the appropriate feature representation. Each term is a feature and you have to decide how you are going to reprsent each feature, which means what kind of weight you will assingn. One way is the tf-idf representation. Then you will be able to train a classifier.

*The only way to avoid labeling manually the texts is to find some already labeled in the same language.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.