How to find the probability of a word to belong to a dataset of text

Question

How to find the probability of a word to belong to a dataset of text

Kevin

2021年7月16日 02:30

I have two text datasets, one with people that have a certain medical condition and another of random patients. I want to figure out what words are more likely to show up in the dataset with that medical condition. My original thought was to use a chi squared test, but it seems like I can't run the test for each word since the tokens are categories and a single word is a value of the categorical variable. For example if the word is dog, what is the probability for it to show up in our dataset with disease? Similar for the word drug, what would be the probability?

Would I use something like tfidf? I have all of the frequencies for all of the tokens.

Topic corpus text feature-selection

Category Data Science

Erwan · Accepted Answer · 2021年7月16日 02:30

A Chi-squared test makes sense but it's only going to tell you whether the difference in frequency is significant or not, by itself it's not very informative about the scale of the difference between the classes.

The simple answer to your question is to calculate the conditional distributions for every word $w$ and class $c$. Using the notation $\#x$ for the frequency of $x$ (i.e. number of documents containing $x$):

$$p(w|c)=\frac{\#(w,c)}{\#c}$$

This represents how frequent $w$ is compared to other words within class $c$ (i.e. ignoring the other class).

$$p(c|w)=\frac{\#(w,c)}{\#w}$$

This represents how frequent class $c$ is compared to the other class when considering only word $w$ (i.e. ignoring other words).

The latter is the one which matters when comparing the two classes, but it's a fair comparison only if the two classes contain the same number of documents.

A more advanced (but maybe less intuitive) method is to use a measure like Pointwise Mutual Information (PMI) between a word an a class. The PMI value is high if the word and the class are strongly associated.

How to find the probability of a word to belong to a dataset of text

About