Search one 2D distribution for point cluster most similar to another 2D distribution

Given a hand drawn constellation (2d distribution of points) and a map of all stars, how would you find the actual star distribution most similar to the drawn distribution? If it's helpful, suppose we can define some maximum allowable threshold of distortion (e.g. a maximum Kolmogorov-Smirnov distance) and we want to find one or more distributions of stars that match the hand-drawn distribution. I keep getting hung up on the fact that the hand-drawn constellation has no notion of scale …
Category: Data Science

How to cluster time series of ordered data?

There are a few hundred time series of a large set of different locations (irregularly distributed) with the following properties: ordered factor (5 levels) between 5 and 25 observations per series lots of missing values within each series temporal and spatial autocorrelation (unknown) temporal frequency The objective is to spatially cluster the time series based on their similarity (of observed value per point in time). What would be adequate methods? The analysis will be carried out in R.
Category: Data Science

Building a graph out of a large text corpus

I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node. One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond …
Category: Data Science

Assigning a new document to a cluster based on keywords extracted and tf-idf

I have about 40 clusters of documents defined by a combination of k-means clustering algorithm and hand curation. For example, some of the clusters given by k-means are too noisy so they have been further subdivided. Now I want to assign new documents to these clusters. I found that it is possible to extract keywords using tf-idf based methods as mentioned here. My approach is to extract key terms from each of these clusters using tf-idf based method and I …
Category: Data Science

What is the logic/algorithm behind 'did you mean' suggestion by search engines, command suggestion in command prompt like git?

For eg. https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work this is the logic behind google's did you mean algorithm - used for spell correction suggestion. What is the algorithm used in case of other search algorithm for spell correction/ to find similar text - in case of a music/OTT search app, eg. amazon music - Similarly - what is the logic used - in case of git commands - How do one usually backtrack the algorithm behind an application from usage? Any general ideas will also …
Category: Data Science

Which string distance equation for fuzzy-matching person names is reliable?

A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext library in R has the following available string methods c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names: The left join shouldn't …
Category: Data Science

When to use cosine simlarity over Euclidean similarity

In NLP, people tend to use cosine similarity to measure document/text distances. I want to hear what do people think of the following two scenarios, which to pick, cosine similarity or Euclidean? Overview of the task set: The task is to compute context similarities of multi-word expressions. For example, suppose we were given an MWE of put up, context refers to the words on the left side of put up and as well as the words on the right side …
Category: Data Science

High Performance Classification or Similarity Algorithim for Mixed Data Types?

I have a database holding 10-ish features that describe different breeds of dogs. They are mostly categorical features, but some provide ranges for values. Here's a demo representation of the database, showing the mixture: |Breed|Min_Height|Max_Height|Min_Weight|Max_Weight|sub_cat|is_friendly| |---------------------------------------------------------------------| |Dober|20 |20 |40 |52 |sport |FALSE | |Pood |15 |25 |35 |45 |water |TRUE | ... As you can see, the data is mixed and the ranges have some overlap from entry to entry. Say I receive an input of: |height|weight|sub_cat|is_friendly| |---------------------------------| |16 |43 …
Category: Data Science

What ways can i find two similar sets of customers use KNN?

I have a study where i want to find users similar to a set of users (SEED). My data looks like a pivot by customer e.g. sample of SEED looks like (note i drop cust_id): cust_id | spend_food | spend_nike | spend_harrods | 1 | 145 | 45 | 32 | 2 | 85 | 89 | 0 | 4 | 23 | 67 | 1900 | 5 | 84 | 12 | 900 | So to find users similar …
Category: Data Science

Similarity Matching Algorithm

I am looking for help on identifying a class of algorithm. If I have a tabular training and test set I want to know the similarity of rows based on some numeric features. The training data would be labelled such that rows would be paired (or even grouped). The output for each row in the test/prediction set would be the most similar row and the probability that it would have been paired with that row. In theory there could be …
Category: Data Science

Word similarity considering special characteristics

I'm looking for an algorithm that computes the similarity between two strings just like the levenshtein distance. However, I want to consider the following. The levenshtein distance gives me the same for these cases: distance("apple", "appli") #1 distance("apple", "appel") #1 distance("apple", "applr") #1 However, I want the second and third example to have a smaller distance because of the following reasons: second example: all the correct letters are used in the second word third example: r is much likely to …
Topic: nlp similarity
Category: Data Science

Semantic Search

There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are: Similar to that phrase Has a part of sentence that is similar to the phrase Sentence which is having contextually similar meanings Let …
Category: Data Science

Deep Learning - Find most similar images - Triplets vs Pairs

I am working with Python, scikit-learn, keras and with 450x540 rgb images of front-faced watches (e.g. Watch_1, Watch_2). My aim to run an autoencoder or a Siemese Neural Network to find the most similar watches among them. However, I am not sure if I will get better results by comparing pairs of images or triplets of images. As it is defined in this research paper, triplets of images consist of one target image, one image which is (more) similar to …
Category: Data Science

Image similarity: Similarity of mixed vector

In order to identify the similarity between images (products) I want to use a neural network approach similar to TiefVision. This pre-trained neural network is basically translating the images into a feature vectors and then creating a similarity measure between the images using a distance measure between the vectors. To make it more tangible have a look at a 2D visual representation below. I want to take it one step further: When a single user "likes" multiple images, I want …
Category: Data Science

How do I calculate a similarity matrix with a Student-t kernel?

As the title says, how do I calculate a similarity matrix with an un-normalized Student-t kernel? I'm attempting to calculate Kullback-Leibler divergence for different t-SNE runs, but need a Q-matrix for that. A few steps before the Q-matrix, I need the similarity matrices made using the un-normalized Student-t kernel. I'm using r, not sure if that's relevant to an answer.
Category: Data Science

How to build a symmetric similarity model on top of embeddings?

I have two equal length vectors that come out of two identical embedding layers. I want to calculate their similarity, and I don't trust the embedding layer enough to just use dot product (e.g. it's plausible that different coordinates are dependent wrt overall similarity). I want to learn this using examples of good and bad pairs, without actually learning the initial embedding. What I'd like to do is to somehow combine the two vectors using another layer, and then connect …
Category: Data Science

Learning similarity of representations

I am interested in a framework for learning the similarity of different input representations based on some common context. I have looked into word2vec, SVD and other recommender systems, which does more or less what I want. I want to know if anyone here has any experience or resources on a more generalized version of this, where I am able to feed in representations on different objects, and learn how similar they are. For example: Say we have some customers …
Category: Data Science

How can we perform STS (Semantic Textual Similarity) on unsupervised dataset using deep learning?

How do you implement STS(Semantic Textual Similarity) on an unlabelled dataset? The dataset column contains Unique_id, text1 (contains paragraph), and text2 (contains paragraph). Ex: Column representation: Unique_id | Text1 | Text2 Unique_id 0 Text1 public show for Reynolds suspension of his coaching licence. portrait Sir Joshua Reynolds portrait of omai will get a public airing following fears it would stay hidden because of an export wrangle. Text2 then requested to do so by Spain's anti-violence commission. The fine was far …
Category: Data Science

Recommender system that connect users with each other , should I go for content based or collaborative filtering?

I am trying to build a system where user come on the platform and he chooses a topic(predefined few topics) and then we connect him with any random online user who chooses the same topic. Then they can do conversation. Now, I am trying to connect them smartly based on user's historical data (users with whom he had match earlier along with time duration of their conversation, and raing after the conversation etc). and his basic profile data. How can …
Category: Data Science

Is there an algorithm or NN to match two documents, basically not closely similar?

Is there an algorithm or NN to match two documents? One is a claim description (e.g. a CV or product offer) and another is a requirements description (e.g. vacancy description or RFP). They are not similar, so basically it's not a docs similarity per se. What's it better embedding to use on document corps (Doc2vec, Word2vec or just TF-IDF? etc) and what kind of further NN architecture would work to basically find a matching scores vector/matrix as output on how …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.