How can I train a model to modify a vector by rewarding the model based on the modified vectors nearest neighbors?

I am experimenting with a document retrieval system in which I have documents represented as vectors. When queries come in, they are turned to vectors by the same method as used for the documents. The query vector's k nearest neighbors are retrieved as the results. Each query has a known answer string. In order to improve performance, I am now looking to create a model that modifies the query vector. What I was looking to do was use a model …
Category: Data Science

Getting 'ValueError: setting an array element with a sequence.' when attempting to fit mixed-type data

I have already seen this, this and this question, but none of the suggestions seemed to fix my problem (so I have reverted them). I have the following code: nlp = spacy.load('en_core_web_sm') parser = English() class CleanTextTransformer(TransformerMixin): def transform(self, X, **transform_params): return [cleanText(text) for text in X] def fit(self, X, y=None, **fit_params): return self def get_params(self, deep=True): return {} def cleanText(text): text = text.strip().replace("\n", " ").replace("\r", " ") text = text.lower() return text def tokenizeText(sample): tokens = parser(sample) lemmas = …
Category: Data Science

How to represent a document in test data with the Document-Term Matrix created from the training set?

I build a classifier of documents using the vector representation of each document in the training set (i.e a row in the Document-Term Matrix). Now I need to test the model on the test data. But how can I represent a new document with the Document-Term Matrix since some terms might not be included in training data?
Category: Data Science

How can we use the cosine similarity formula on document feature vector without a direction?

In mathematics, a vector has both magnitude and direction. In data science, for identifying document similarity we convert the document into a feature vector. Then apply cosine angle formula between the source and target document's feature vector. However the cosine formula is applicable only for vectors. And a vector should have both magnitude snd direction. For a document that is represented as a vector, where is the direction?
Category: Data Science

Combine multiple vector fields for approximate nearest neighbor search

I have multiple vector fields in one collection. My use-case is to find similar sentences in similar contexts. The sentences and contexts are encoded to float vectors. Therefore, I have one vector for the sentence and another vector for the context (surrounding text). I would like take both vectors in consideration to find similar sentences. Unfortunately, most approximate nearest neighbor (ann) search libraries only support to search for one field. I have tried to use PostgreSQL with the cube extension …
Category: Data Science

Approximate maximum dot product between a vector and set of vectors using only a single vector representation for the latter

If we have a vector $q$ and a set of vectors $D = \{d_1, d_2, ..., d_l\}$ is there a way to create functions $QF$ and $DF$ such that $QF(q)^TDF(D) \approx \max_i(q^Td_i)$ ? Use case: I want to build an information retrieval system in which documents are represented by an arbitrary but small ($<100$) number of vectors and the query is represented by a single vector. Ideally, I would like to sort the documents based on $\max_i(q^Td_i)$ but storing all …
Category: Data Science

Dummy vectors and performance measurement for vector search Face Recognition

I have about thousands of person face (from celebrity dataset LFW), which each person represented by 512 x 1 vector. I stored it on vector DB to build face searching system using embedded feature (MTCNN for face detection and arcface for embedding model). Someone suggest me to add many vectors as "dummy faces" to the database with unknown class (the number of the vectors is larger than the personal class). It's still unclear for me why I need to add …
Category: Data Science

how to calculate similarity between users based on movie ratings

Hi I am working on a movie recommendation system and I have to find alikeness between the main user and other users. For example, the main user watched 3 specific movies and rated them as 8,5,7. A user who happened to watch the same movies rated them as 8,2,3 and an another user of the same kind rated those movies as 7,6,6 and some other user only watched first two movies and he rated them as 8,5. Now the question …
Category: Data Science

How come same cluster category be separated?

I have these 200 vectors which were clustered using K-means clustering based on keywords weight similarity that was given by TF-IDF (Term Frequency - Inverse Document Frequency). The vectors were clustered with respect to the vectors in four cities which are Amsterdam, Rotterdam, The Hague and, Utrecht. I have chosen k-cluster centroid = 6, which means I have cluster 0 to cluster 5. On each cluster, I also calculated the average number of keyword's numerical weight so that then I …
Category: Data Science

Can I sum up feature vectors of a user‘s collection?

I want to find items that are similar to items users already have in their collection. Every item has attributes, so I created feature vectors where every element of the vector represents an attribute and is either $0$ or $1$ (if an item has that attribute). For the user collection I summed up all vectors, creating one vector which I then used to calculate similarities with other items. Is this a correct approach or should I make this "user vector", …
Category: Data Science

Word2Vec: Identifying many-to-one relationships between words

Standard introductory examples in Word2Vec, like king - queen = man - woman and tokyo - japan = london - uk, involve one-to-one relationships between words: Tokyo is the exclusive capital of Japan. More generally, we might want to test for many-to-one relationships: e.g. we might want to ask if Kyoto is a city in Japan. I presume we are still interested in vectors of the form kyoto - japan, houston - us, etc., but these vectors are no longer …
Category: Data Science

Non-commutative distance formula

I am trying to find a distance formula or a method that can give the non-commutative distance between two points in a feature space. Suppose there are two movies represented in an R^n feature space. Now I want that when I try to find the distance/similarity between these movies using the feature vectors, I get different values with respect to which movie is the reference point i.e., Dist(Mov1, Mov2) != Dist(Mov2, Mov1) I know this is slightly vague, but I …
Category: Data Science

Is it acceptable to append information to word embeddings?

Let's say I have my 300 dimensional word embedding trained with Word2Vec and it contains 10,000 word vectors. I have additional data on the 10,000 words in the form of a vector (10,000x1), containing values between 0 and 1. Can I simply append the vector to the word embedding so that I have a 301 dimensional embedding? I am looking to calculate similarities between word vectors using cosine similarity.
Category: Data Science

How can I model the autocorrelation of objective variables under the situation where we can't observe any actual objective variable in the test phase

I'm trying to model the relationship between the declared value from a subject and stimulus. For example, modeling a relationship between the subject's happiness and strength of stimulus so that we can predict the subject's sadness from stimuli. (The Happiness are five scale ratings, stimuli are continuous value) Emotions like happiness are obviously autocorrelated and I think modeling these autocorrelations might help the model make a better prediction. However, we can only observe happiness (actual value) in the training phase …
Category: Data Science

Stacking/Concatenating/Combining two vector space models

I have two vector-space models, with different dimensions. The number of vectors in one model is the same as the number of vectors in the other. I.E: if I have vector representation for a car in one model, I have vector representation for a car in the other model, but the number of dimensions can be different. I want to combine these models (and then cluster using the combined model), I cannot average (BoW) or add these models together as …
Category: Data Science

Is it accurate to say that "K-means clustering the vectors based on keywords weight similarity"?

Long story short, I have 200 vectors as a result of TF-IDF (Term Frequency - Inverse Document Frequency) on thousands of keywords in hundreds of vectors. The total number of unique keywords that I got is 745 keywords, meaning that there are 745 dimensions/axes. Now, I was wondering how does K-means clustering work on those 200 vectors? Is it accurate to say that K-Means is clustering those 200 vectors by the keywords weight similarity?
Category: Data Science

what is the difference between positional vector and attention vector used in transformer model?

what is the difference between positional vector and attention vector used in transformer model ? , i saw a video in youtue and the defintion for positional vector was give as :* "vector that gives context based on postion of word in sentence "* defintion for attention vector was give as "For ever word we can have attention vector generated which captures contextual relationship between words in sentence" Capturing context information based on distance(postional vector) and attention (attention vector ) …
Category: Data Science

NN embedding layer

Several neural network libraries such as tensorflow and pytorch offer an Embedding layer. Having implemented word2vec in the past, I understand the reasoning behind wanting a lower dimensional representation. However, it would seem the embedding layer is just a linear layer. All other things being equal, would an embedding layer not just learn the same weights as the equivalent linear layer? If so, then what are the advantages of using an embedding layer? In the case of word2vec, the lower …
Category: Data Science

Why is n-grams language independent?

I don't understand how n-grams are language independent. I've read that by using character n-grams of a word than the word itself as dimensions of a vector space model, we can skip the language-dependent pre-processing such as stemming and stop word removal. Can someone please provide reasoning for this?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.