Train a spaCy model for semantic similarity

I'm attempting to train a spaCy model for the purposes of computing semantic similarity but I'm not getting the results I would anticipate. I have created two text files that contain many sentences that use a new term, "PROJ123456". For example, "PROJ123456 is on track." I've added each to a DocBin and saved them to disk as train.spacy and dev.spacy. I'm then running: python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy The config.cfg file contains: [paths] train …
Category: Data Science

Cluster words into groups of similar meaning (synonyms)

How can words be clustered into groups of similar meaning (synonyms)? I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example: polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, "terrible" and …
Category: Data Science

Which model is better able to understand the difference that two sentences are talking about different things?

I'm currently working on the task of measuring semantic proximity between sentences. I use fasttext train _unsiupervised (skipgram) for this. I extract the sentence embeddings and then measure the cosine similarity between them. however, I ran into the following problem: cosine similarity between embeddings of these sentences: "Create a documentation of product A"; "he is creating a documentation of product B" is very high (>0.9). obviously it because both of them is about creating a documentation. but however the first …
Category: Data Science

Semantic Search

There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are: Similar to that phrase Has a part of sentence that is similar to the phrase Sentence which is having contextually similar meanings Let …
Category: Data Science

Semantic network using word2vec

I have thousands of headlines and I would like to build a semantic network using word2vec, specifically google news files. My sentences look like Titles Dogs are humans’ best friends A dog died because of an accident You can clean dogs’ paws using natural products. A cat was found in the kitchen And so on. What I would like to do is finding some specific pattern within this data, e.g. similarity in topics on dogs and cats, using semantic networks. …
Category: Data Science

Siamese networks vs Semantic similarity (may be gensim)

I am trying to understand the Siamese networks . In this vector is calculated for an object (say an image) and a distance metric is applied (say manhatten) on two vectors produced by the neural network(s). The idea was applied mostly to images in the tutorials provided on internet. If I compare it with Gensim semantic similarity, there also we have vectors of two objects (words or sentences) and then do a cosine similarity to calculate the difference. (remember example …
Category: Data Science

Weighting Sentence Similarity by salience or frequency

It seems like the new standard in text search is sentence or document similarity, using things like BERT sentence embeddings. However, these don't really have a way to consider the salience of sentences, which can make it hard to compare different searches. For example, when using concept embeddings I'd like to be able to score "Exam" <-> "Exam" as less important than "Diabetes" <-> "High blood sugar". But obviously, the former has a similarity score of 1. I've tried using …
Category: Data Science

Document Similarity with User Preference

To measure the similarity between two documents, one can use, e.g. TF-IDF/Cosine Similarity. Supposing that after calculating the similarity scores of Doc A against a list of Documents (Doc B, Doc C,...), we got: Document Pair Similarity Score Doc A vs. Doc B 0.45 Doc A vs. Doc C 0.30 Doc A vs. ... ... Of course, Doc B seems to be the closest one, in terms of similarity, for Doc A. But what if Users, as humans, think Doc …
Category: Data Science

Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder. For example, for a single record in my dataset, feature vector looks like this: text feature's embedding is 512 dimension vector - 1 X 512 categorical (non-ordered) feature vector - 1 X 500 (since there are 500 unique values in the feature) my …
Category: Data Science

How to choose similarity measurement between sentences and paragraphs

Problems 1. How to find appropriate measurement method There are several ways to measure sentence similarities, but I have no idea how to find appropriate method among them for my data (sentences). Related Question on Stack overflow: is there a way to check similarity between two full sentences in python? 2. Sentence or paragraph based If it is possible to acquire both one sentence and a paragraph which includes the sentence, which is more accurate to measure the similarity among …
Category: Data Science

NLP: Compare tags semantically with machine learning? (finding synonyms)

Let's say I have multiple tags that I need to compare semantically. For example: tags = ['Python', 'Football', 'Programming', 'Handball', 'Chess', 'Cheese', 'board game'] I would like compare these tags (and many more) semantically to find a similarity value between 0 and 1. For example, I want to get values like these: f('Chess','Cheese') = 0.0 # tags look similar, but means very different things f('Chess', 'board game') = 0.9 # because chess is a board game f('Football', 'Handball') = 0.3 …
Category: Data Science

How to Calculate semantic similarity between video captions?

I intend to calculate the accuracy of a caption generated by comparing it to a number of reference sentences. For example, the captions for one video are as follows: These captions are for the same video only. However, reference sentences have been broken down with respect to different segments of a video. Reference sentences (R): A man is walking along while pushing his bicycle. He tries to balance himself by taking support from a pole. Then he falls on the …
Category: Data Science

NLP: Checking that answers to a question are correct

Question answering is a common topic within NLP, but my problem is a little different: rather than answering a question, I have a question, an (open-ended) answer, and what I want to check is if that answer is correct. For instance, if I have the question: "Have you done X?" I would like to be able to say that "Yes, I have done X." is correct, and "Yes, I have done Y." is incorrect. Going a step further, this should …
Category: Data Science

Document Content

I have a set of .pdf/.docx documents with content. I need to search for the most suitable document according to a particular sentence. For instance: Sentence: "Security in the work environment" The system should return the most appropriate document which contains at least the content expressed in the sentence. It should be a sort of search bar with advanced capabilities; I have a constraint: I can not have an apriori classification since the number of documents and the related category …
Category: Data Science

Is there a way to train Doc2Vec on a corpus of docs and be able to take a novel doc and see how similar it is to the trained corpus?

I have a project idea, where I train a bunch of documents on Doc2Vec and then take a novel, input doc, and ideally be able to be told how similar it is to the docs supplied for training as a whole or how well it "fits" with the training docs. Is there a way to do this?
Category: Data Science

How to handle words not in the dictionary (while finding similar words)?

I am doing a project on Semantic text analysis where my data has column Technical skills (so I have to train data to find similar words) which are words and not sentences. So I wish to find similar technical skills when I pass a word. I am aware of using Word2Vec and Glove. My issue is that if I pass for example Pyton which is actually (Python). So since the misspelled word is not in the trained words it will …
Category: Data Science

Comparing the similarity structure of 2 distance matrices (computed from sentence embedding)

I apologize if this question lacks clarity, my mathematical background on the topic is limited and was hoping to find some guidance. I would like to compare 2 distance matrices that contain pair-wise semantic (cosine) similarities for a set of 33 sentences. The matrices were created from sentence embedding, i.e., embedding of full sentences in a vector space (I used Google's Universal Sentence Encoder, so the vector space has 512 dimensions). The sets of sentences that underlie the 2 distance …
Category: Data Science

applicability of relative similarity computation

I've computed the cosine similarity between a & b (=x) and b & c (=y). I can use the same embeddings to compute the similarity between a and c (assuming it's = z). I've a situation wherein I've only the similarity measures x and y. How can I find the similarity between a & c, without the original embeddings? If I use a plane to represent this then I will have infinite number of solutions. Are there any approaches which …
Category: Data Science

BERT Optimization for Production

I'm using BERT to transform text into 768 dim vector, It's multilingual : from sentence_transformers import SentenceTransformer model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2') Now i want to put the model into production but the embedding time is too much and i want to reduce and optimize the model to reduce the embedding time What are the libraries that enable me to do this ?
Category: Data Science

Cosine vs Manhattan for Text Similarity

I'm storing sentences in Elasticsearch as dense_vector field and used BERT for the embedding so each vector is 768 dim. Elasticsearch gives similarity function options like Euclidean, Manhattan and cosine similarity. I have tried them and both Manhattan and cosine gives me very similar and good results and now i don't know which one should i choose ?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.