Topic alignment / topic modelling

What is the most efficient method for detecting whether the article is mostly about a specific topic, but without lots of data for training? My task is to determine how much a document is e.g. about the weather or holidays or several other specific topics.

I was looking towards LDA and TFIDF but from what I understand this approach is unsupervised and works well for clustering/grouping large number of documents based on vocabulary frequency. These techniques have a limitation in terms of controlling what topics the algorithm should focus on. Additionally, in my case, I do not have a lot of data to train the model on. So I was thinking about generating lists of tokens characteristic to some specific topics and then measure cosine similarity with word2vec between the vocab used in the document with the list of target tokens.

My questions are: 1. is it the right way forward or there are better ways of achieving this? 2. How the final score should be calculated - is an average of similarities between tokens okay? I am afraid that for e.g. if I create 100 target tokens per each topic, the similarities will somehow cancel out yielding similar score. 3. What I like about LDA is that it shows the probability levels per multiple topics. Is there an algorithm similar to LDA, where I could seed the topics rather than merely stipulate the number of clusters?

Thanks for reading this.

Topic doc2vec tfidf word2vec lda

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.