Usage of KL divergence to improve BOW model

For a university project, I chose to do sentiment analysis on a Google Play store reviews dataset. I obtained decent results classifying the data using the bag of words (BOW) model and an ADALINE classifier.

I would like to improve my model by incorporating bigrams relevant to the topic (Negative or Positive) in my features set. I found this paper which uses KL divergence to measure the relevance of unigrams/bigrams relative to a topic.

The only problem is that I am having trouble understanding what C refers to in the equation (2.2). Does it refer to the unique words associated with topic C, the set of documents on a topic, or the words in a document?

Topic bag-of-words ngrams classification

Category Data Science


Since those are academic researchers, they framed the problem in the most general way possible. The $C$ term could be any random variable to be modeled. In this specific case, $C$ is the individual tokens (unigrams or bigrams).

I have found empirical improvement by including bigrams highly ranked by collocations, frequently occurring n-grams. By including common phrases, a model can better capture how language is used in that specific context. Finding collocations is relatively straightforward - rank the occurrence of all n-grams, then set a threshold to limit to only the most popular.

Those authors are looking for unique information which far more complex to model and often not necessary for model lift.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.