Can we use doc2vec to detect outlier documents?

J Cena

2022年4月18日 12:03

I have a set of documents and I want to identify and remove the outlier documents. I am just wondering if doc2vec can be used for this task.

Or are there any recently evolved, promising algorithms that I can use for this task?

EDIT

I am currently using a bag of words model to identify outliers.

Topic gensim word2vec outlier nlp data-mining

Category Data Science

Brian Spiering answered at 2021年7月19日 15:48

One way to approach it:

Define a center tendency of the documents, a location in vector space.
Then, define a distance metric (e.g., cosine, Minkowski, or Mahalanobis).
Lastly, set a threshold in the distance metric that would define an outlier.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.