I'm attempting to train a spaCy model for the purposes of computing semantic similarity but I'm not getting the results I would anticipate. I have created two text files that contain many sentences that use a new term, "PROJ123456". For example, "PROJ123456 is on track." I've added each to a DocBin and saved them to disk as train.spacy and dev.spacy. I'm then running: python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy The config.cfg file contains: [paths] train …
How can words be clustered into groups of similar meaning (synonyms)? I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example: polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, "terrible" and …
I'm currently working on the task of measuring semantic proximity between sentences. I use fasttext train _unsiupervised (skipgram) for this. I extract the sentence embeddings and then measure the cosine similarity between them. however, I ran into the following problem: cosine similarity between embeddings of these sentences: "Create a documentation of product A"; "he is creating a documentation of product B" is very high (>0.9). obviously it because both of them is about creating a documentation. but however the first …
There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are: Similar to that phrase Has a part of sentence that is similar to the phrase Sentence which is having contextually similar meanings Let …
I have thousands of headlines and I would like to build a semantic network using word2vec, specifically google news files. My sentences look like Titles Dogs are humans’ best friends A dog died because of an accident You can clean dogs’ paws using natural products. A cat was found in the kitchen And so on. What I would like to do is finding some specific pattern within this data, e.g. similarity in topics on dogs and cats, using semantic networks. …
I am trying to understand the Siamese networks . In this vector is calculated for an object (say an image) and a distance metric is applied (say manhatten) on two vectors produced by the neural network(s). The idea was applied mostly to images in the tutorials provided on internet. If I compare it with Gensim semantic similarity, there also we have vectors of two objects (words or sentences) and then do a cosine similarity to calculate the difference. (remember example …
It seems like the new standard in text search is sentence or document similarity, using things like BERT sentence embeddings. However, these don't really have a way to consider the salience of sentences, which can make it hard to compare different searches. For example, when using concept embeddings I'd like to be able to score "Exam" <-> "Exam" as less important than "Diabetes" <-> "High blood sugar". But obviously, the former has a similarity score of 1. I've tried using …
To measure the similarity between two documents, one can use, e.g. TF-IDF/Cosine Similarity. Supposing that after calculating the similarity scores of Doc A against a list of Documents (Doc B, Doc C,...), we got: Document Pair Similarity Score Doc A vs. Doc B 0.45 Doc A vs. Doc C 0.30 Doc A vs. ... ... Of course, Doc B seems to be the closest one, in terms of similarity, for Doc A. But what if Users, as humans, think Doc …
My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder. For example, for a single record in my dataset, feature vector looks like this: text feature's embedding is 512 dimension vector - 1 X 512 categorical (non-ordered) feature vector - 1 X 500 (since there are 500 unique values in the feature) my …
Problems 1. How to find appropriate measurement method There are several ways to measure sentence similarities, but I have no idea how to find appropriate method among them for my data (sentences). Related Question on Stack overflow: is there a way to check similarity between two full sentences in python? 2. Sentence or paragraph based If it is possible to acquire both one sentence and a paragraph which includes the sentence, which is more accurate to measure the similarity among …
Let's say I have multiple tags that I need to compare semantically. For example: tags = ['Python', 'Football', 'Programming', 'Handball', 'Chess', 'Cheese', 'board game'] I would like compare these tags (and many more) semantically to find a similarity value between 0 and 1. For example, I want to get values like these: f('Chess','Cheese') = 0.0 # tags look similar, but means very different things f('Chess', 'board game') = 0.9 # because chess is a board game f('Football', 'Handball') = 0.3 …
I intend to calculate the accuracy of a caption generated by comparing it to a number of reference sentences. For example, the captions for one video are as follows: These captions are for the same video only. However, reference sentences have been broken down with respect to different segments of a video. Reference sentences (R): A man is walking along while pushing his bicycle. He tries to balance himself by taking support from a pole. Then he falls on the …
Question answering is a common topic within NLP, but my problem is a little different: rather than answering a question, I have a question, an (open-ended) answer, and what I want to check is if that answer is correct. For instance, if I have the question: "Have you done X?" I would like to be able to say that "Yes, I have done X." is correct, and "Yes, I have done Y." is incorrect. Going a step further, this should …
I have a set of .pdf/.docx documents with content. I need to search for the most suitable document according to a particular sentence. For instance: Sentence: "Security in the work environment" The system should return the most appropriate document which contains at least the content expressed in the sentence. It should be a sort of search bar with advanced capabilities; I have a constraint: I can not have an apriori classification since the number of documents and the related category …
I have a project idea, where I train a bunch of documents on Doc2Vec and then take a novel, input doc, and ideally be able to be told how similar it is to the docs supplied for training as a whole or how well it "fits" with the training docs. Is there a way to do this?
I am doing a project on Semantic text analysis where my data has column Technical skills (so I have to train data to find similar words) which are words and not sentences. So I wish to find similar technical skills when I pass a word. I am aware of using Word2Vec and Glove. My issue is that if I pass for example Pyton which is actually (Python). So since the misspelled word is not in the trained words it will …
I apologize if this question lacks clarity, my mathematical background on the topic is limited and was hoping to find some guidance. I would like to compare 2 distance matrices that contain pair-wise semantic (cosine) similarities for a set of 33 sentences. The matrices were created from sentence embedding, i.e., embedding of full sentences in a vector space (I used Google's Universal Sentence Encoder, so the vector space has 512 dimensions). The sets of sentences that underlie the 2 distance …
I've computed the cosine similarity between a & b (=x) and b & c (=y). I can use the same embeddings to compute the similarity between a and c (assuming it's = z). I've a situation wherein I've only the similarity measures x and y. How can I find the similarity between a & c, without the original embeddings? If I use a plane to represent this then I will have infinite number of solutions. Are there any approaches which …
I'm using BERT to transform text into 768 dim vector, It's multilingual : from sentence_transformers import SentenceTransformer model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2') Now i want to put the model into production but the embedding time is too much and i want to reduce and optimize the model to reduce the embedding time What are the libraries that enable me to do this ?
I'm storing sentences in Elasticsearch as dense_vector field and used BERT for the embedding so each vector is 768 dim. Elasticsearch gives similarity function options like Euclidean, Manhattan and cosine similarity. I have tried them and both Manhattan and cosine gives me very similar and good results and now i don't know which one should i choose ?