I am experimenting with a document retrieval system in which I have documents represented as vectors. When queries come in, they are turned to vectors by the same method as used for the documents. The query vector's k nearest neighbors are retrieved as the results. Each query has a known answer string. In order to improve performance, I am now looking to create a model that modifies the query vector. What I was looking to do was use a model …
I have already seen this, this and this question, but none of the suggestions seemed to fix my problem (so I have reverted them). I have the following code: nlp = spacy.load('en_core_web_sm') parser = English() class CleanTextTransformer(TransformerMixin): def transform(self, X, **transform_params): return [cleanText(text) for text in X] def fit(self, X, y=None, **fit_params): return self def get_params(self, deep=True): return {} def cleanText(text): text = text.strip().replace("\n", " ").replace("\r", " ") text = text.lower() return text def tokenizeText(sample): tokens = parser(sample) lemmas = …
I build a classifier of documents using the vector representation of each document in the training set (i.e a row in the Document-Term Matrix). Now I need to test the model on the test data. But how can I represent a new document with the Document-Term Matrix since some terms might not be included in training data?
In mathematics, a vector has both magnitude and direction. In data science, for identifying document similarity we convert the document into a feature vector. Then apply cosine angle formula between the source and target document's feature vector. However the cosine formula is applicable only for vectors. And a vector should have both magnitude snd direction. For a document that is represented as a vector, where is the direction?
I have multiple vector fields in one collection. My use-case is to find similar sentences in similar contexts. The sentences and contexts are encoded to float vectors. Therefore, I have one vector for the sentence and another vector for the context (surrounding text). I would like take both vectors in consideration to find similar sentences. Unfortunately, most approximate nearest neighbor (ann) search libraries only support to search for one field. I have tried to use PostgreSQL with the cube extension …
I have vectors of same length where each entry can have the value 0, 1 or null. V = {[0,1,1,1,null,0], [null,1,0,null,0,1], ...} How can I perform a dimensionality reduction of these vectors into a lower dimensional space (in this case 2d)?
If we have a vector $q$ and a set of vectors $D = \{d_1, d_2, ..., d_l\}$ is there a way to create functions $QF$ and $DF$ such that $QF(q)^TDF(D) \approx \max_i(q^Td_i)$ ? Use case: I want to build an information retrieval system in which documents are represented by an arbitrary but small ($<100$) number of vectors and the query is represented by a single vector. Ideally, I would like to sort the documents based on $\max_i(q^Td_i)$ but storing all …
I have about thousands of person face (from celebrity dataset LFW), which each person represented by 512 x 1 vector. I stored it on vector DB to build face searching system using embedded feature (MTCNN for face detection and arcface for embedding model). Someone suggest me to add many vectors as "dummy faces" to the database with unknown class (the number of the vectors is larger than the personal class). It's still unclear for me why I need to add …
Hi I am working on a movie recommendation system and I have to find alikeness between the main user and other users. For example, the main user watched 3 specific movies and rated them as 8,5,7. A user who happened to watch the same movies rated them as 8,2,3 and an another user of the same kind rated those movies as 7,6,6 and some other user only watched first two movies and he rated them as 8,5. Now the question …
I have these 200 vectors which were clustered using K-means clustering based on keywords weight similarity that was given by TF-IDF (Term Frequency - Inverse Document Frequency). The vectors were clustered with respect to the vectors in four cities which are Amsterdam, Rotterdam, The Hague and, Utrecht. I have chosen k-cluster centroid = 6, which means I have cluster 0 to cluster 5. On each cluster, I also calculated the average number of keyword's numerical weight so that then I …
I want to find items that are similar to items users already have in their collection. Every item has attributes, so I created feature vectors where every element of the vector represents an attribute and is either $0$ or $1$ (if an item has that attribute). For the user collection I summed up all vectors, creating one vector which I then used to calculate similarities with other items. Is this a correct approach or should I make this "user vector", …
Standard introductory examples in Word2Vec, like king - queen = man - woman and tokyo - japan = london - uk, involve one-to-one relationships between words: Tokyo is the exclusive capital of Japan. More generally, we might want to test for many-to-one relationships: e.g. we might want to ask if Kyoto is a city in Japan. I presume we are still interested in vectors of the form kyoto - japan, houston - us, etc., but these vectors are no longer …
I am trying to find a distance formula or a method that can give the non-commutative distance between two points in a feature space. Suppose there are two movies represented in an R^n feature space. Now I want that when I try to find the distance/similarity between these movies using the feature vectors, I get different values with respect to which movie is the reference point i.e., Dist(Mov1, Mov2) != Dist(Mov2, Mov1) I know this is slightly vague, but I …
Let's say I have my 300 dimensional word embedding trained with Word2Vec and it contains 10,000 word vectors. I have additional data on the 10,000 words in the form of a vector (10,000x1), containing values between 0 and 1. Can I simply append the vector to the word embedding so that I have a 301 dimensional embedding? I am looking to calculate similarities between word vectors using cosine similarity.
I'm trying to model the relationship between the declared value from a subject and stimulus. For example, modeling a relationship between the subject's happiness and strength of stimulus so that we can predict the subject's sadness from stimuli. (The Happiness are five scale ratings, stimuli are continuous value) Emotions like happiness are obviously autocorrelated and I think modeling these autocorrelations might help the model make a better prediction. However, we can only observe happiness (actual value) in the training phase …
I have two vector-space models, with different dimensions. The number of vectors in one model is the same as the number of vectors in the other. I.E: if I have vector representation for a car in one model, I have vector representation for a car in the other model, but the number of dimensions can be different. I want to combine these models (and then cluster using the combined model), I cannot average (BoW) or add these models together as …
Long story short, I have 200 vectors as a result of TF-IDF (Term Frequency - Inverse Document Frequency) on thousands of keywords in hundreds of vectors. The total number of unique keywords that I got is 745 keywords, meaning that there are 745 dimensions/axes. Now, I was wondering how does K-means clustering work on those 200 vectors? Is it accurate to say that K-Means is clustering those 200 vectors by the keywords weight similarity?
what is the difference between positional vector and attention vector used in transformer model ? , i saw a video in youtue and the defintion for positional vector was give as :* "vector that gives context based on postion of word in sentence "* defintion for attention vector was give as "For ever word we can have attention vector generated which captures contextual relationship between words in sentence" Capturing context information based on distance(postional vector) and attention (attention vector ) …
Several neural network libraries such as tensorflow and pytorch offer an Embedding layer. Having implemented word2vec in the past, I understand the reasoning behind wanting a lower dimensional representation. However, it would seem the embedding layer is just a linear layer. All other things being equal, would an embedding layer not just learn the same weights as the equivalent linear layer? If so, then what are the advantages of using an embedding layer? In the case of word2vec, the lower …
I don't understand how n-grams are language independent. I've read that by using character n-grams of a word than the word itself as dimensions of a vector space model, we can skip the language-dependent pre-processing such as stemming and stop word removal. Can someone please provide reasoning for this?