cosine-distance

Is summing a cosine similarity matrix a good way to determine overall similarity?

Brian Guan

2022年5月21日 22:02

I'm trying to similar research abstracts, so I'm using word embeddings to convert words into 1x768 vectors, so overall turning abstracts into embeddings with shape (#ofwords, 768). Cosine similarity between two abstracts returns a matrix (#ofwords1, #ofwords2), which I then sum up to get an overall score. What I'm wondering is if this summing up of all the values in a cosine similarity matrix is really a good way to determine overall similarity between two different texts? Is there a …

Topic: cosine-distance nlp

Category: Data Science

What ways can i find two similar sets of customers use KNN?

Maths12

2022年5月12日 14:28

I have a study where i want to find users similar to a set of users (SEED). My data looks like a pivot by customer e.g. sample of SEED looks like (note i drop cust_id): cust_id | spend_food | spend_nike | spend_harrods | 1 | 145 | 45 | 32 | 2 | 85 | 89 | 0 | 4 | 23 | 67 | 1900 | 5 | 84 | 12 | 900 | So to find users similar …

Topic: k-nn cosine-distance similarity recommender-system machine-learning

Category: Data Science

counter vector fit transform cosine similarity memory error

slowmonk

2022年5月1日 11:01

count_matrix = count.fit_transform(off_data3['bag_of_words']) I have count_matrix shape with count_matrix.shape (476147, 482824) cosine_sim = cosine_similarity(count_matrix, count_matrix) I think the matrix size is too big to cause this memory error --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) in ~/venv/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output) 1034 1035 K = safe_sparse_dot(X_normalized, Y_normalized.T, -> 1036 dense_output=dense_output) 1037 1038 return K ~/venv/lib/python3.6/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output) 135 """ 136 if sparse.issparse(a) or sparse.issparse(b): --> 137 ret = a * b 138 if dense_output and hasattr(ret, "toarray"): 139 …

Topic: data-analysis cosine-distance nlp machine-learning

Category: Data Science

How fit_transform, transform and TfidfVectorizer works

nananinanana

2022年4月19日 21:04

I'm a machine learning beginner and I tried to use the cosine similarity on fuzzy matching purpose. In the following example I want to compare 'data_dirty' with 'data_clean' : When I have to vectorize my data I do not really understand what is the purpose of fit_transform and WHY 'dirty_idf_matrix' has ONLY transform argument with SAME vectorizer than 'clean_idf_matrix' which has saved the value with fit if I understood well. Col_clean = 'fruits_normalized' Col_dirty = 'fruits' #read table data_dirty={f'{Col_dirty}':['I am …

Topic: fuzzy-logic cosine-distance scikit-learn python machine-learning

Category: Data Science

Cosine-like alternative to Mahalanobis distance

a_gdevr

2022年4月10日 13:41

I would like to have a distance measure that takes into account how spread are vectors in a dataset, to weight the absolute distance from one point to another. The Mahalanobis distance does exactly this, but it is a generalization of Euclidean distance, which is not particularly suitable for high-dimensional spaces (see for instance here). Do you know of any measure that is suitable in high-dimensional spaces while also taking into account the correlation between datapoints? Thank you! :)

Topic: cosine-distance distance

Category: Data Science

Can siamese model trained with euclidean distance as distance metric use cosine similarity during inference?

B200011011

2022年4月4日 18:00

If I have 3 embeddings Anchor, Positive, Negative from a Siamese model trained with Euclidean distance as distance metric for triplet loss. During inference can cosine similarity similarity be used? I have noticed if I calculate Euclidean distance with model from A, P, N results seem somewhat consistent with matching images getting smaller distance and non-matching images getting bigger distance in most cases. In case I use cosine similarity on above embeddings I am unable to differentiate as similarity values …

Topic: siamese-networks cosine-distance distance deep-learning machine-learning

Category: Data Science

Siamese networks vs Semantic similarity (may be gensim)

Sandeep Bhutani

2022年3月27日 18:06

I am trying to understand the Siamese networks . In this vector is calculated for an object (say an image) and a distance metric is applied (say manhatten) on two vectors produced by the neural network(s). The idea was applied mostly to images in the tutorials provided on internet. If I compare it with Gensim semantic similarity, there also we have vectors of two objects (words or sentences) and then do a cosine similarity to calculate the difference. (remember example …

Topic: semantic-similarity siamese-networks cnn gensim cosine-distance

Category: Data Science

Elbow method for cosine distance

Ruuza

2022年3月25日 21:04

I have clustered vectors by cosine distance using nltk clusterer. If I understand correctly, Y axis for elbow method in euclidian distance would be the sum of every distance (squared) between centroid of the cluster with vectors that belongs to that cluster. My question is: Would it be the same for clusters using cosine distance? EDIT: ok, so i tried sum of squares with cosine distance and it seems, that it's returning the same values... heres my code: EDIT2: My …

Topic: cosine-distance nltk

Category: Data Science

Document matching with more priority to certain features than others

Reckoner

2022年3月18日 02:03

I am working on recommendation systems wherein I need to match the similarity of 2 users. Now, I know that I can use Tfidf vectorizer to calculate the the cosine similarity score between them. But, now suppose I have some features where I have different priorities for those features. So, for each feature there will be a different priority and the one with with higher priority will be checked first. So, when I get cosine similarity based on that feature, …

Topic: cosine-distance recommender-system

Category: Data Science

Document Similarity with User Preference

JoyfulPanda

2022年3月17日 02:30

To measure the similarity between two documents, one can use, e.g. TF-IDF/Cosine Similarity. Supposing that after calculating the similarity scores of Doc A against a list of Documents (Doc B, Doc C,...), we got: Document Pair Similarity Score Doc A vs. Doc B 0.45 Doc A vs. Doc C 0.30 Doc A vs. ... ... Of course, Doc B seems to be the closest one, in terms of similarity, for Doc A. But what if Users, as humans, think Doc …

Topic: semantic-similarity similar-documents cosine-distance tfidf similarity

Category: Data Science

Tag texts using predefined keywords based on the importance

Casper

2022年3月9日 12:48

I want to tag a list of texts using predefined keywords ex: keyword1, keyword2, keyword3. I can easily achieve this using one to one mapping (If keywords exist in the text tagged as important). But in this way, I cannot find which texts are more important than others(assuming texts contain more than one keywords are more important). To achieve this one I've decided to train word2vec model and extract vectors then calculate cosine similarity between keywords vector and text vector …

Topic: cosine-distance word2vec nlp

Category: Data Science

How to get the probability/closeness of a sample belonging to a specific cluster?

Jaskaran Singh Puri

2022年3月3日 13:06

I'm new to this so please let me know if my logic of comparing cosine similarity and k-means is incorrect I got a set of 4 clusters from k-means and now I'm interested in the Cluster No. 1. For this cluster, I take the average of all values for each column and keep it aside. Now, I have a test sample, for which I run k-means prediction and I get output as 1, meaning it belongs to Cluster No. 1 …

Topic: unsupervised-learning cosine-distance classification k-means clustering

Category: Data Science

How to create word2vec for phrases and then calculate cosine similarity

user3778289

2022年2月20日 20:06

I just started using word2vec and have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate the cosine similarity between these 2 lists. For example: list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management'] list2 = ['microsoft visual studio','desktop virtualization', 'microsoft exchange server','cloud computing','windows server 2008'] Any help would be appreciated.

Topic: data-analysis cosine-distance word2vec python

Category: Data Science

NLP Interview Coding Task

dokondr

2022年1月31日 17:37

Please comment on the following NLP Interview Coding Task that I have prepared for the candidates on Data Science NLP position that I am looking for. The goal is to check candidate understanding of the fundamental role of text representations with vectors in NLP, as well as checking candidate coding skills and their ability to optimize computations with vectorization that Numpy provides. In particular I need your opinion on: Is task clear? Is task adequate for coding a rough solution …

Topic: cosine-distance classification nlp

Category: Data Science

Conceptual question about cosine similarity and clustering algorithms for word embeddings

sigma_factor

2022年1月20日 12:33

Is the following statement true? https://stats.stackexchange.com/q/256778 The value of cosine similarity between two terms itself is not indicator whether they are similar or not. If yes then how is use of clustering algorithms like DBSCAN for word embeddings justified? From what I know DBSCAN algorithm only looks to its immediate neighbour to be included in cluster, but it seems the wrong way since maybe we need to check every word with every other word and take top ranked words.

Topic: cosine-distance word-embeddings clustering

Category: Data Science

Cosine similarity vs The Levenshtein distance

Pluviophile

2022年1月5日 16:41

Cosine similarity vs The Levenshtein distance I wanted to know what is the difference between them and in what situations they work best? As per my understanding: Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians. The Levenshtein distance is a string metric for …

Topic: metric cosine-distance similarity

Category: Data Science

clustering 2-dimensional euclidean vectors - appropriate dissimilarity measure

jakes

2021年12月12日 13:04

I've got a set of approx. 50 000 2-dimensional euclidean vectors which are connected with 20 groups, i.e. each group has approx. 2500 2-dimensional euclidean vectors. My data includes endpoints coordinates, i.e. $x_0, y_0, x_1, y_1$. Now I would like to cluster the vectors within these groups, probably using k-means/k-medoids clutering (or other clustering algorithm with pre-defined no. of clusters). What is also important, my main focus is on vector's direction, length is the minor concern (but at best, still …

Topic: cosine-distance distance similarity k-means clustering

Category: Data Science

Which approach is beneficent for identifying the fake news detection?

Hamza

2021年11月17日 09:52

The problem is to identify the fake news detection, As this is text classification problem . Constraints are basically that we cannot use traditional machine learning and deep learning approaches. If we move towards Machine learning then we can easily sort out this problem, using Naive Bayes or Logistic Regression etc but we cannot use this. I want to take yours suggestion that using cosine similarity can we done this , Take out the text, apply the feature embedding techniques …

Topic: text-classification cosine-distance optimization

Category: Data Science

K-means++ with cosine distance

Night bird

2021年9月18日 12:10

I am wondering how to implement k-means++ with cosine distance, acording to quote below (wikipedia), which says, that distance needs to be squared. But with square is lost direction of distance which in my understanding really matters. cos_dist(x,y) = -1 => (-1)^2 = 1 Choose one center uniformly at random among the data points. For each data point x not chosen yet, compute D(x), the distance between x and the nearest center that has already been chosen. Choose one new …

Topic: cosine-distance k-means clustering machine-learning

Category: Data Science

Cosine similarity between sentence embeddings is always positive

albus

2021年9月15日 06:45

I have a list of documents and I am looking for a) duplicates; b) documents that are very similar. To do so, I proceed as follows: Embed the documents using paraphrase-xlm-r-multilingual-v1. Calculate the cosine similarity between the vector embeddings (code below). All the cosine similarity values I get are between 0 and 1. Why is that? Shouldn't I also have negative cos similarity values? The sentence embeddings have both positive and negative elements. num_docs = np.array(sentence_embedding).shape[0] cos_sim = np.zeros([num_docs, num_docs]) …

Topic: cosine-distance nlp python

Category: Data Science

About