Latent Semantic Indexing False Positive Detection

Question

Latent Semantic Indexing False Positive Detection

2015年12月14日 13:00

I am using the Gensim LsiModel. I have a set of documents and fixed set of topics. Some of the documents are already categorized, others are not. The goal is to categorize the uncategorized documents with the most relevant category. I am using a similarity search as described here.

http://radimrehurek.com/gensim/tut3.html

So, I am comparing each uncategorized document to the categorized corpus to find the most relevant category. I am seeing very good performance on documents which have an appropriate category. However, it is to be expected that some documents will not have an relevant category, e.g. it is in Spanish, it is spam, or it just does not fit into an existing category. With this model every document is categorized and the best fit is the category with the highest similarity score. My question is, how can I determine when there is not relevant category? My assumption is that the similarity measures for the documents should all be low, but this is not always true. This also seems to be an arbitrary measure. Are there better ways to say a particular document does not fit well into existing categories?

Topic lsi gensim

Category Data Science

Latent Semantic Indexing False Positive Detection

About