Hellinger Distance in Gensim

I have set of documents as follows where each document has set of words that represents the content of it.

Doc1: {fish, moose, wildlife, hunting, bears, polar}
Doc2: {energy, fuel, costs, oil, gas}
Doc3: {wildlife, hunt, polar, fishing}

So, if I look at my documents I can deduce that Doc1 and Doc3 are very much similar.

I want distance metrics for bag-of-words. I followed some tutorials in Gensim about how to do it. However, as I understand, initially they train a model and then use that model to calculate the Hellinger Distance. However, in my case, I do not have any training data. Hence, please let me know how to achieve this with no training data.

Topic gensim text-mining topic-model data-mining machine-learning

Category Data Science


Jaccard similarity coefficient is one option to compare the distance between a bag-of-word representation of documents.

Here is the associated Python code:

doc_1 = {"fish", "moose", "wildlife", "hunting", "bears", "polar"}
doc_2 = {"energy", "fuel", "costs", "oil", "gas"}
doc_3 = {"wildlife", "hunt", "polar", "fishing"}

def jaccard(a: set, b:set) -> float:
    return len(a.intersection(b)) / len(a.union(b))

assert jaccard(doc_1, doc_2) == 0.00
assert jaccard(doc_2, doc_3) == 0.00
assert jaccard(doc_1, doc_3) == 0.25

According to the Jaccard similarity coefficient, Doc1 and Doc3 are the most similar.


As u said that one has to train a model in order to calculate Hellinger distance. I am not sure that which model u are talking about but for now I assume that u might have thinking about the latent model so according to me there is not mandatory to train a latent model to produce a document vector. one can create a document vector from the directly word-doc matrix and can calculate Hellinger distance. One simple trick is that I used is first calculate a bag of word representation after getting doc-word matrix by gensim utilities. kindly review it and let me know your opinion.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.