semantic-similarity

Train a spaCy model for semantic similarity

motormal

2022年6月4日 13:04

I'm attempting to train a spaCy model for the purposes of computing semantic similarity but I'm not getting the results I would anticipate. I have created two text files that contain many sentences that use a new term, "PROJ123456". For example, "PROJ123456 is on track." I've added each to a DocBin and saved them to disk as train.spacy and dev.spacy. I'm then running: python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy The config.cfg file contains: [paths] train …

Topic: semantic-similarity spacy training nlp

Category: Data Science

Cluster words into groups of similar meaning (synonyms)

Ben

2022年5月31日 22:08

How can words be clustered into groups of similar meaning (synonyms)? I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example: polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, "terrible" and …

Topic: semantic-similarity text word-embeddings nlp clustering

Category: Data Science

Which model is better able to understand the difference that two sentences are talking about different things?

Ir8_mind

2022年5月26日 10:11

I'm currently working on the task of measuring semantic proximity between sentences. I use fasttext train _unsiupervised (skipgram) for this. I extract the sentence embeddings and then measure the cosine similarity between them. however, I ran into the following problem: cosine similarity between embeddings of these sentences: "Create a documentation of product A"; "he is creating a documentation of product B" is very high (>0.9). obviously it because both of them is about creating a documentation. but however the first …

Topic: semantic-similarity transformer word-embeddings deep-learning nlp

Category: Data Science

Semantic Search

Farhaan Bukhsh

2022年5月1日 17:03

There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are: Similar to that phrase Has a part of sentence that is similar to the phrase Sentence which is having contextually similar meanings Let …

Topic: semantic-similarity similar-documents unsupervised-learning word-embeddings similarity

Category: Data Science

Semantic network using word2vec

Math

2022年4月27日 11:01

I have thousands of headlines and I would like to build a semantic network using word2vec, specifically google news files. My sentences look like Titles Dogs are humans’ best friends A dog died because of an accident You can clean dogs’ paws using natural products. A cat was found in the kitchen And so on. What I would like to do is finding some specific pattern within this data, e.g. similarity in topics on dogs and cats, using semantic networks. …

Topic: semantic-similarity word2vec neural-network nlp python

Category: Data Science

Siamese networks vs Semantic similarity (may be gensim)

Sandeep Bhutani

2022年3月27日 18:06

I am trying to understand the Siamese networks . In this vector is calculated for an object (say an image) and a distance metric is applied (say manhatten) on two vectors produced by the neural network(s). The idea was applied mostly to images in the tutorials provided on internet. If I compare it with Gensim semantic similarity, there also we have vectors of two objects (words or sentences) and then do a cosine similarity to calculate the difference. (remember example …

Topic: semantic-similarity siamese-networks cnn gensim cosine-distance

Category: Data Science

Weighting Sentence Similarity by salience or frequency

Aaron Casey

2022年3月21日 23:20

It seems like the new standard in text search is sentence or document similarity, using things like BERT sentence embeddings. However, these don't really have a way to consider the salience of sentences, which can make it hard to compare different searches. For example, when using concept embeddings I'd like to be able to score "Exam" <-> "Exam" as less important than "Diabetes" <-> "High blood sugar". But obviously, the former has a similarity score of 1. I've tried using …

Topic: semantic-similarity bert word-embeddings nlp search

Category: Data Science

Document Similarity with User Preference

JoyfulPanda

2022年3月17日 02:30

To measure the similarity between two documents, one can use, e.g. TF-IDF/Cosine Similarity. Supposing that after calculating the similarity scores of Doc A against a list of Documents (Doc B, Doc C,...), we got: Document Pair Similarity Score Doc A vs. Doc B 0.45 Doc A vs. Doc C 0.30 Doc A vs. ... ... Of course, Doc B seems to be the closest one, in terms of similarity, for Doc A. But what if Users, as humans, think Doc …

Topic: semantic-similarity similar-documents cosine-distance tfidf similarity

Category: Data Science

Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

Bruso

2022年2月21日 05:04

My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder. For example, for a single record in my dataset, feature vector looks like this: text feature's embedding is 512 dimension vector - 1 X 512 categorical (non-ordered) feature vector - 1 X 500 (since there are 500 unique values in the feature) my …

Topic: categorical-encoding semantic-similarity feature-scaling

Category: Data Science

How to choose similarity measurement between sentences and paragraphs

Mahler

2022年2月7日 14:47

Problems 1. How to find appropriate measurement method There are several ways to measure sentence similarities, but I have no idea how to find appropriate method among them for my data (sentences). Related Question on Stack overflow: is there a way to check similarity between two full sentences in python? 2. Sentence or paragraph based If it is possible to acquire both one sentence and a paragraph which includes the sentence, which is more accurate to measure the similarity among …

Topic: semantic-similarity nlp python similarity

Category: Data Science

NLP: Compare tags semantically with machine learning? (finding synonyms)

DarkMath

2022年2月6日 05:00

Let's say I have multiple tags that I need to compare semantically. For example: tags = ['Python', 'Football', 'Programming', 'Handball', 'Chess', 'Cheese', 'board game'] I would like compare these tags (and many more) semantically to find a similarity value between 0 and 1. For example, I want to get values like these: f('Chess','Cheese') = 0.0 # tags look similar, but means very different things f('Chess', 'board game') = 0.9 # because chess is a board game f('Football', 'Handball') = 0.3 …

Topic: semantic-similarity nlp machine-learning

Category: Data Science

How to Calculate semantic similarity between video captions?

Vaidehi

2022年2月1日 12:07

I intend to calculate the accuracy of a caption generated by comparing it to a number of reference sentences. For example, the captions for one video are as follows: These captions are for the same video only. However, reference sentences have been broken down with respect to different segments of a video. Reference sentences (R): A man is walking along while pushing his bicycle. He tries to balance himself by taking support from a pole. Then he falls on the …

Topic: semantic-similarity tokenization word-embeddings nlp python

Category: Data Science

NLP: Checking that answers to a question are correct

user1991

2022年1月11日 15:52

Question answering is a common topic within NLP, but my problem is a little different: rather than answering a question, I have a question, an (open-ended) answer, and what I want to check is if that answer is correct. For instance, if I have the question: "Have you done X?" I would like to be able to say that "Yes, I have done X." is correct, and "Yes, I have done Y." is incorrect. Going a step further, this should …

Topic: semantic-similarity question-answering nlp python

Category: Data Science

Document Content

user3043636

2021年12月28日 16:02

I have a set of .pdf/.docx documents with content. I need to search for the most suitable document according to a particular sentence. For instance: Sentence: "Security in the work environment" The system should return the most appropriate document which contains at least the content expressed in the sentence. It should be a sort of search bar with advanced capabilities; I have a constraint: I can not have an apriori classification since the number of documents and the related category …

Topic: semantic-similarity nlp

Category: Data Science

Is there a way to train Doc2Vec on a corpus of docs and be able to take a novel doc and see how similar it is to the trained corpus?

sangstar

2021年12月6日 00:26

I have a project idea, where I train a bunch of documents on Doc2Vec and then take a novel, input doc, and ideally be able to be told how similar it is to the docs supplied for training as a whole or how well it "fits" with the training docs. Is there a way to do this?

Topic: doc2vec semantic-similarity nlp

Category: Data Science

How to handle words not in the dictionary (while finding similar words)?

10sha25

2021年9月23日 15:20

I am doing a project on Semantic text analysis where my data has column Technical skills (so I have to train data to find similar words) which are words and not sentences. So I wish to find similar technical skills when I pass a word. I am aware of using Word2Vec and Glove. My issue is that if I pass for example Pyton which is actually (Python). So since the misspelled word is not in the trained words it will …

Topic: semantic-similarity gensim word2vec nlp python

Category: Data Science

Comparing the similarity structure of 2 distance matrices (computed from sentence embedding)

keun

2021年9月10日 16:52

I apologize if this question lacks clarity, my mathematical background on the topic is limited and was hoping to find some guidance. I would like to compare 2 distance matrices that contain pair-wise semantic (cosine) similarities for a set of 33 sentences. The matrices were created from sentence embedding, i.e., embedding of full sentences in a vector space (I used Google's Universal Sentence Encoder, so the vector space has 512 dimensions). The sets of sentences that underlie the 2 distance …

Topic: semantic-similarity distance nlp dimensionality-reduction

Category: Data Science

applicability of relative similarity computation

Van Peer

2021年8月13日 09:09

I've computed the cosine similarity between a & b (=x) and b & c (=y). I can use the same embeddings to compute the similarity between a and c (assuming it's = z). I've a situation wherein I've only the similarity measures x and y. How can I find the similarity between a & c, without the original embeddings? If I use a plane to represent this then I will have infinite number of solutions. Are there any approaches which …

Topic: semantic-similarity cosine-distance word-embeddings nlp similarity

Category: Data Science

BERT Optimization for Production

Mohy Mohamed

2021年7月9日 09:34

I'm using BERT to transform text into 768 dim vector, It's multilingual : from sentence_transformers import SentenceTransformer model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2') Now i want to put the model into production but the embedding time is too much and i want to reduce and optimize the model to reduce the embedding time What are the libraries that enable me to do this ?

Topic: semantic-similarity bert transformer nlp

Category: Data Science

Cosine vs Manhattan for Text Similarity

Mohy Mohamed

2021年6月26日 22:40

I'm storing sentences in Elasticsearch as dense_vector field and used BERT for the embedding so each vector is 768 dim. Elasticsearch gives similarity function options like Euclidean, Manhattan and cosine similarity. I have tried them and both Manhattan and cosine gives me very similar and good results and now i don't know which one should i choose ?

Topic: semantic-similarity manhattan similar-documents cosine-distance

Category: Data Science

About