similarity

Search one 2D distribution for point cluster most similar to another 2D distribution

duhaime

2022年6月4日 09:55

Given a hand drawn constellation (2d distribution of points) and a map of all stars, how would you find the actual star distribution most similar to the drawn distribution? If it's helpful, suppose we can define some maximum allowable threshold of distortion (e.g. a maximum Kolmogorov-Smirnov distance) and we want to find one or more distributions of stars that match the hand-drawn distribution. I keep getting hung up on the fact that the hand-drawn constellation has no notion of scale …

Topic: pattern-recognition distribution distance similarity

Category: Data Science

How to cluster time series of ordered data?

Rúben

2022年6月1日 07:25

There are a few hundred time series of a large set of different locations (irregularly distributed) with the following properties: ordered factor (5 levels) between 5 and 25 observations per series lots of missing values within each series temporal and spatial autocorrelation (unknown) temporal frequency The objective is to spatially cluster the time series based on their similarity (of observed value per point in time). What would be adequate methods? The analysis will be carried out in R.

Topic: geospatial time-series similarity r

Category: Data Science

Building a graph out of a large text corpus

kevin_was_here

2022年5月28日 10:19

I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node. One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond …

Topic: similar-documents graphs text-mining nlp similarity

Category: Data Science

Assigning a new document to a cluster based on keywords extracted and tf-idf

Kami

2022年5月27日 05:05

I have about 40 clusters of documents defined by a combination of k-means clustering algorithm and hand curation. For example, some of the clusters given by k-means are too noisy so they have been further subdivided. Now I want to assign new documents to these clusters. I found that it is possible to extract keywords using tf-idf based methods as mentioned here. My approach is to extract key terms from each of these clusters using tf-idf based method and I …

Topic: tfidf similarity clustering

Category: Data Science

What is the logic/algorithm behind 'did you mean' suggestion by search engines, command suggestion in command prompt like git?

jarvis

2022年5月19日 14:47

For eg. https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work this is the logic behind google's did you mean algorithm - used for spell correction suggestion. What is the algorithm used in case of other search algorithm for spell correction/ to find similar text - in case of a music/OTT search app, eg. amazon music - Similarly - what is the logic used - in case of git commands - How do one usually backtrack the algorithm behind an application from usage? Any general ideas will also …

Topic: text nlp similarity search

Category: Data Science

Which string distance equation for fuzzy-matching person names is reliable?

Canovice

2022年5月17日 12:29

A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext library in R has the following available string methods c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names: The left join shouldn't …

Topic: jaccard-coefficient similarity r

Category: Data Science

When to use cosine simlarity over Euclidean similarity

Logan

2022年5月16日 15:54

In NLP, people tend to use cosine similarity to measure document/text distances. I want to hear what do people think of the following two scenarios, which to pick, cosine similarity or Euclidean? Overview of the task set: The task is to compute context similarities of multi-word expressions. For example, suppose we were given an MWE of put up, context refers to the words on the left side of put up and as well as the words on the right side …

Topic: nlp similarity clustering machine-learning

Category: Data Science

High Performance Classification or Similarity Algorithim for Mixed Data Types?

CyberBully2003

2022年5月13日 14:23

I have a database holding 10-ish features that describe different breeds of dogs. They are mostly categorical features, but some provide ranges for values. Here's a demo representation of the database, showing the mixture: |Breed|Min_Height|Max_Height|Min_Weight|Max_Weight|sub_cat|is_friendly| |---------------------------------------------------------------------| |Dober|20 |20 |40 |52 |sport |FALSE | |Pood |15 |25 |35 |45 |water |TRUE | ... As you can see, the data is mixed and the ranges have some overlap from entry to entry. Say I receive an input of: |height|weight|sub_cat|is_friendly| |---------------------------------| |16 |43 …

Topic: supervised-learning classification python similarity clustering

Category: Data Science

What ways can i find two similar sets of customers use KNN?

Maths12

2022年5月12日 14:28

I have a study where i want to find users similar to a set of users (SEED). My data looks like a pivot by customer e.g. sample of SEED looks like (note i drop cust_id): cust_id | spend_food | spend_nike | spend_harrods | 1 | 145 | 45 | 32 | 2 | 85 | 89 | 0 | 4 | 23 | 67 | 1900 | 5 | 84 | 12 | 900 | So to find users similar …

Topic: k-nn cosine-distance similarity recommender-system machine-learning

Category: Data Science

Similarity Matching Algorithm

Keith

2022年5月8日 11:06

I am looking for help on identifying a class of algorithm. If I have a tabular training and test set I want to know the similarity of rows based on some numeric features. The training data would be labelled such that rows would be paired (or even grouped). The output for each row in the test/prediction set would be the most similar row and the probability that it would have been paired with that row. In theory there could be …

Topic: similarity machine-learning

Category: Data Science

Word similarity considering special characteristics

ahs312

2022年5月5日 08:36

I'm looking for an algorithm that computes the similarity between two strings just like the levenshtein distance. However, I want to consider the following. The levenshtein distance gives me the same for these cases: distance("apple", "appli") #1 distance("apple", "appel") #1 distance("apple", "applr") #1 However, I want the second and third example to have a smaller distance because of the following reasons: second example: all the correct letters are used in the second word third example: r is much likely to …

Topic: nlp similarity

Category: Data Science

Semantic Search

Farhaan Bukhsh

2022年5月1日 17:03

There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are: Similar to that phrase Has a part of sentence that is similar to the phrase Sentence which is having contextually similar meanings Let …

Topic: semantic-similarity similar-documents unsupervised-learning word-embeddings similarity

Category: Data Science

Deep Learning - Find most similar images - Triplets vs Pairs

Outcast

2022年5月1日 10:01

I am working with Python, scikit-learn, keras and with 450x540 rgb images of front-faced watches (e.g. Watch_1, Watch_2). My aim to run an autoencoder or a Siemese Neural Network to find the most similar watches among them. However, I am not sure if I will get better results by comparing pairs of images or triplets of images. As it is defined in this research paper, triplets of images consist of one target image, one image which is (more) similar to …

Topic: deep-learning neural-network python similarity

Category: Data Science

Image similarity: Similarity of mixed vector

Gegenwind

2022年4月28日 08:03

In order to identify the similarity between images (products) I want to use a neural network approach similar to TiefVision. This pre-trained neural network is basically translating the images into a feature vectors and then creating a similarity measure between the images using a distance measure between the vectors. To make it more tangible have a look at a 2D visual representation below. I want to take it one step further: When a single user "likes" multiple images, I want …

Topic: image-recognition neural-network similarity

Category: Data Science

How do I calculate a similarity matrix with a Student-t kernel?

BioMatt

2022年4月27日 08:07

As the title says, how do I calculate a similarity matrix with an un-normalized Student-t kernel? I'm attempting to calculate Kullback-Leibler divergence for different t-SNE runs, but need a Q-matrix for that. A few steps before the Q-matrix, I need the similarity matrices made using the un-normalized Student-t kernel. I'm using r, not sure if that's relevant to an answer.

Topic: tsne visualization dataset similarity r

Category: Data Science

How to build a symmetric similarity model on top of embeddings?

Uri

2022年4月26日 18:06

I have two equal length vectors that come out of two identical embedding layers. I want to calculate their similarity, and I don't trust the embedding layer enough to just use dot product (e.g. it's plausible that different coordinates are dependent wrt overall similarity). I want to learn this using examples of good and bad pairs, without actually learning the initial embedding. What I'd like to do is to somehow combine the two vectors using another layer, and then connect …

Topic: keras word-embeddings similarity

Category: Data Science

Learning similarity of representations

user10283726

2022年4月20日 19:02

I am interested in a framework for learning the similarity of different input representations based on some common context. I have looked into word2vec, SVD and other recommender systems, which does more or less what I want. I want to know if anyone here has any experience or resources on a more generalized version of this, where I am able to feed in representations on different objects, and learn how similar they are. For example: Say we have some customers …

Topic: word2vec deep-learning similarity recommender-system

Category: Data Science

How can we perform STS (Semantic Textual Similarity) on unsupervised dataset using deep learning?

Vishal Singh

2022年4月19日 20:01

How do you implement STS(Semantic Textual Similarity) on an unlabelled dataset? The dataset column contains Unique_id, text1 (contains paragraph), and text2 (contains paragraph). Ex: Column representation: Unique_id | Text1 | Text2 Unique_id 0 Text1 public show for Reynolds suspension of his coaching licence. portrait Sir Joshua Reynolds portrait of omai will get a public airing following fears it would stay hidden because of an export wrangle. Text2 then requested to do so by Spain's anti-violence commission. The fine was far …

Topic: unsupervised-learning deep-learning nlp similarity

Category: Data Science

Recommender system that connect users with each other , should I go for content based or collaborative filtering?

Piyush Singhal

2022年4月19日 19:02

I am trying to build a system where user come on the platform and he chooses a topic(predefined few topics) and then we connect him with any random online user who chooses the same topic. Then they can do conversation. Now, I am trying to connect them smartly based on user's historical data (users with whom he had match earlier along with time duration of their conversation, and raing after the conversation etc). and his basic profile data. How can …

Topic: similarity recommender-system statistics clustering machine-learning

Category: Data Science

Is there an algorithm or NN to match two documents, basically not closely similar?

Yuriy P

2022年4月19日 06:07

Is there an algorithm or NN to match two documents? One is a claim description (e.g. a CV or product offer) and another is a requirements description (e.g. vacancy description or RFP). They are not similar, so basically it's not a docs similarity per se. What's it better embedding to use on document corps (Doc2vec, Word2vec or just TF-IDF? etc) and what kind of further NN architecture would work to basically find a matching scores vector/matrix as output on how …

Topic: deep-learning text-mining neural-network similarity machine-learning

Category: Data Science

About