Search one 2D distribution for point cluster most similar to another 2D distribution

Given a hand drawn constellation (2d distribution of points) and a map of all stars, how would you find the actual star distribution most similar to the drawn distribution? If it's helpful, suppose we can define some maximum allowable threshold of distortion (e.g. a maximum Kolmogorov-Smirnov distance) and we want to find one or more distributions of stars that match the hand-drawn distribution. I keep getting hung up on the fact that the hand-drawn constellation has no notion of scale …
Category: Data Science

Date transformation for KNN

I have data set with date features like 01/01/2019 and I would like to use KNN. However, I cannot find a good transformation for dates that has a meaningful distance result for the last feature. For example: f1 | 1 | 2 | 3 | 4 | 01/01/2019 f2 | 10 | 3 | 12 | 1 | 14/01/2019 Does anyone have any recommendations?
Category: Data Science

Estimating time to travel between two lat/longs

I'm trying to create an offline estimator for how long it would take to get from one lat/long to another. Two approaches I have come across are the Haversine distance and the Manhattan distance. What I'm thinking of doing is calculating both of them and then using the average between the two as the distance and then use some average speed to calculate time. Since this value will be used as an estimator for drivers in a city a straight …
Category: Data Science

How to estimate real distance between two detected objects in an image?

You may think this is a duplicate, but my situation is different than previously asked questions. The only information I have is the width and height of the bounding boxes of detected people. The dataset I'm working on has images captured in different environments (street, garden, mall, ...). In other words, there is no fixed object in all images I can use as scale. The angle at which each image is captured varies drastically from almost parallel to the ground …
Category: Data Science

Clustering time series based on monotonic similarity

Context I am involved in the task of clustering 1500 time series of 500 observations into a few clusters. The time series share all the same observed properties at different spatial locations, but responding to the same exogenous variables. However, for each time series, the magnitude of the response is very different. For a time series of reference $X$, I would like to be grouped in the same cluster series that are alike $X^a$ for all $a > 0$. Tryouts …
Category: Data Science

Given daily sequence of events with only event ID labels (alphanum strings), what algorithms can be used to detect sequences that are outliers?

For example, the data might be something like this: Sequence 1: ["ABC", "AAA", "ZZ123", "RRZZZ45", "AABBCC"] Sequence 2: ["CBA", "AAA", "YY123", "LMNOP", "AABBCC"] Sequence 3: ["ABC", "AAA", "ZZ123", "RRZZZ45", "AABBCC"] ... Sequence N: ["DEF", "AAA", "ZZ123", "YYZZZ45", "AABBCC"] Sequence 1 and 3 are the same, but sequence 2 and N are different. In the data set, there will be thousands of these sequences every day. Additional questions: How could I calculate similarity (or difference) measure between sequences with sequences of …
Category: Data Science

Siamese vs matching network for correct image category matching

I have to find the closest match between my image and bunch of already collected images of different classes in the folder. Whic meta-learning approach should I select. I am thinking about the Siamese or matching network. In Siamese, I have to match my image with all existing images in the folder to find the correct match. So do you think if I can use a matching network and produce a better result? What is the parameter based on which …
Category: Data Science

Distance Metric between 2 lists of sets

I have 2 list of of sets and I want to calculate a distance. set1 = [ {'A', 'B', 'C'}, {'A', 'D', 'X'}, {'X', 'A'} ] set2 = [ {'A', 'B', 'C', 'D'}, {'A', 'X'}, {'X', 'A', 'B'} ] So if the set of sets are equal I want the distance to be 0, and if unequal then I want the distance to be higher than 0. The exact distance doesn't really matter as I'll ultimately be aggregating to compare …
Category: Data Science

Cosine-like alternative to Mahalanobis distance

I would like to have a distance measure that takes into account how spread are vectors in a dataset, to weight the absolute distance from one point to another. The Mahalanobis distance does exactly this, but it is a generalization of Euclidean distance, which is not particularly suitable for high-dimensional spaces (see for instance here). Do you know of any measure that is suitable in high-dimensional spaces while also taking into account the correlation between datapoints? Thank you! :)
Category: Data Science

Can siamese model trained with euclidean distance as distance metric use cosine similarity during inference?

If I have 3 embeddings Anchor, Positive, Negative from a Siamese model trained with Euclidean distance as distance metric for triplet loss. During inference can cosine similarity similarity be used? I have noticed if I calculate Euclidean distance with model from A, P, N results seem somewhat consistent with matching images getting smaller distance and non-matching images getting bigger distance in most cases. In case I use cosine similarity on above embeddings I am unable to differentiate as similarity values …
Category: Data Science

Assessing Group Similarities and Dissimilarities Post PCA

The goal is to assess similarity and dissimilarity between 6 known groups. The original data began with the 6 known groups and 2,700+ variables all on a scale of 0 to 100. I have performed PCA to reduce the 2700+ variables into 5 principal components using the dudi.pca function from the ade4 package in R. Here are the Eigenvalues for the components: eigenvalue variance.percent cumulative.variance.percent Dim.1 998.3274 36.635867 36.63587 Dim.2 670.1278 24.591848 61.22771 Dim.3 482.2372 17.696776 78.92449 Dim.4 352.2806 12.927728 …
Category: Data Science

When would one use Manhattan distance as opposed to Euclidean distance?

I am trying to look for a good argument on why one would use the Manhattan distance over the Euclidean distance in machine learning. The closest thing I found to a good argument so far is on this MIT lecture. At 36:15 you can see on the slides the following statement: "Typically use Euclidean metric; Manhattan may be appropriate if different dimensions are not comparable." Shortly after, the professor says that, because the number of legs of a reptile varies …
Category: Data Science

Levenshtein distance vs simple for loop

I have recently begun studying different data science principles, and have had a particular interest as of late in fuzzy matching. For preface, I'd like to include smarter fuzzy searching in a proprietary language named "4D" in my workplace, so access to libraries is pretty much non existent. It's also worth noting that client side is single threaded currently, so taking advantage of multi-threaded matrix manipulations is out of the question. I began studying the levenshtein algorithm and got that …
Category: Data Science

Vectorized String Distance

I am looking for a way to calculate the string distance between two Pandas dataframe columns in a vectorized way. I tried distance and textdistance libraries but they require to use df.apply which is incredibly slow. Do you know any way to have a string distance using only column operations ? Thanks
Category: Data Science

How can i use Hellinger Distance on array of different length?

I have to use Hellinger distance to compare arrays that are not the same length. How do you do this correctly? Putting a zero in the missing fields for the shorter array does not sound like the best method to me. Some more info on my data: Most array dimensions are (1,58), but a some others are (1,28). Arrays contain numbers from 1 to 3. Example: Array1=[1 1 3 2 3] Array2=[2 3 1 1] One possible solution: newArray2=[2 3 …
Category: Data Science

Clustering without information about identifier

I have a data-set with different products and binary value if it was sold in a store or not. I looks like: product_id store_1 store_2 store_3 store_4 store_5 store_6 0 A 1 0 0 1 0 1 1 B 1 1 0 0 1 0 Is there any way to cluster these products with any information about the products itself? One thought I had was to generate distance between products and then cluster the product X product matrix. Is this …
Category: Data Science

What's an appropriate clustering quality estimate / metric for precomputed distance in HDBSCAN?

HBDSCAN supports estimation of clusters from precomputed distances. However, the python implementation of HDBSCAN (scikit-contrib) doesn't create minimum spanning trees in the absence of raw data when precomputed distance matrices are provided as inputs. Therefore, it doesn't compute the relative_validity score or DBCV score to facilitate hyperparameter tuning in such instances. I am trying to use a Euclidean projection (squareroot transform) of Gower dissimilarity composite (without Podini's option) as a precomputed metric in HDBSCAN. Since distance-based scores like Silhuette are …
Category: Data Science

Question about Similarity vs Dissimilarity Matrix

Right now, I'm working on coming up with a similarity vs dissimilarity matrix for a set of data points for a clustering algorithm. My question is if I want to use one of the many clustering algorithms given in $R$, such as the K-Medoids algorithm, does it require a similarity or dissimilarity matrix as its parameter? What's the difference between the two? If I use the Gower Distance from the Daisy function in R, does it output a similarity or …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.