Classify documents using a set of known vocabularies
I have a bunch of documents that I want to classify which ones talk about soccer (unsupervised learning, I do not want to manually label the documents).
One way I am thinking about is to go online and search for the most popular words in soccer articles to make a list of vocabularies (for example: score, shoot, World Cup, etc). Then somehow use that list of vocabularies to classify the documents (maybe if a particular contains 30% of the words in that list of vocabularies, then that document talks about soccer).
I am wondering whether it is a valid method or there are better existing methods. Really appreciate any help.
Topic text-mining topic-model nlp
Category Data Science