How to find the probability of a word to belong to a dataset of text
I have two text datasets, one with people that have a certain medical condition and another of random patients. I want to figure out what words are more likely to show up in the dataset with that medical condition. My original thought was to use a chi squared test, but it seems like I can't run the test for each word since the tokens are categories and a single word is a value of the categorical variable. For example if the word is dog, what is the probability for it to show up in our dataset with disease? Similar for the word drug, what would be the probability?
Would I use something like tfidf? I have all of the frequencies for all of the tokens.
Topic corpus text feature-selection
Category Data Science