Cluster Evaluation with Jaccard and Rand Index

I've clusterized my data according to 3 criteria in 3 groups. I used kmeans to obtain those cluster so the label for each cluster is random and changes at each script run. To evaluate the consistency of my clusters I decided to use Jaccard index but I can't understand how to apply it properly. Let's say I have this data where alpha beta and gamma are the 3 methods, and the Cluster Index is the value returned by K-means for …
Category: Data Science

Which string distance equation for fuzzy-matching person names is reliable?

A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext library in R has the following available string methods c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names: The left join shouldn't …
Category: Data Science

Distance Metric between 2 lists of sets

I have 2 list of of sets and I want to calculate a distance. set1 = [ {'A', 'B', 'C'}, {'A', 'D', 'X'}, {'X', 'A'} ] set2 = [ {'A', 'B', 'C', 'D'}, {'A', 'X'}, {'X', 'A', 'B'} ] So if the set of sets are equal I want the distance to be 0, and if unequal then I want the distance to be higher than 0. The exact distance doesn't really matter as I'll ultimately be aggregating to compare …
Category: Data Science

Metrics - multi-class model comparisons

I am looking for a way to quantify the performance of multi-class model labelers, and thus compare them. I want to account for the fact that some classes are ‘closer’ than others (for example a car is ‘closer’ to a ‘truck’ than a ‘flower’ is. So, if a labeler classifies a car as a truck that is better than classifying the car as a flower. I am considering using a Jaccard similarity score. Will this do what I want?
Category: Data Science

What is the correct formula for Jaccard coefficient with integer vectors?

I understand the Jaccard index is the number of elements in common divided by the total number of distinct elements. But it seems to be some discrepancy or terminology confusion about Jaccard being applied to "binary vectors", meaning a vector with binary attributes (0, 1), or, "integer vectors", meaning any vector with integer values (2, 5, 6, 8). There are two formulas depending on the type of elements in the vector? This answer comments about "binary vectors" which "they can …
Category: Data Science

Efficiently Sending Two Series to a Function For Strings with an application to String Matching (Dice Coefficient)

I am using a Dice Coefficient based function to calculate the similarity of two strings: def dice_coefficient(a,b): try: if not len(a) or not len(b): return 0.0 except: return 0.0 if a == b: return 1.0 if len(a) == 1 or len(b) == 1: return 0.0 a_bigram_list = [a[i:i+2] for i in range(len(a)-1)] b_bigram_list = [b[i:i+2] for i in range(len(b)-1)] a_bigram_list.sort() b_bigram_list.sort() lena = len(a_bigram_list) lenb = len(b_bigram_list) matches = i = j = 0 while (i < lena and j …
Category: Data Science

What is the state of the art/research metric to compare ellipses but jaccard coefficient?

Im looking for the, if there is one, metric to compare ellipses with each other. Last time a had a similar dataset (malaria cells, now its pupiles) i used jaccard coefficient but that was more because of i didnt had the time to do further research on this topic. I jused the jaccard coefficient like that: - transform the multi-d data in 1D to make the comparation even possible. Even tho it worked quite well i didnt like it that …
Category: Data Science

Jaccard Similarity with Binary Data

I have 5400 rows of data and 3211 columns of attributes. The first 4 columns are ID/Name/ParentID/ObjectType - the rest of the 3207 columns are the attributes that are to be used for similarity measures. Huge dimensionality, I know, but I wanted to (as a first step) just see how this data clusters and finds similarity between all attributes. I converted all attributes values to "0" if there was no value, and "1" if there was a value. I thought …
Category: Data Science

Similarity of search results using Jaccard

I have a set of search results with ranking position, keyword and URL. I want to make a distance matrix so I can cluster the keywords (or the URLs). One approach would be to take the first n URL rankings for each keyword and use Jaccard similarity. However, I also want higher position ranks to be weighted more highly than lower position ranks - for example two keywords that have the same URL in positions 1 and 2 are more …
Category: Data Science

Jaccard similarity between two items

Calculating similarity between two users is rather straightforward. Consider following example: User A = {7,3,2,4,1} User B = {4,1,9,7,5} Products in common = {1,4,7} Union of products = {1,2,3,4,5,7,9} Hence the Jaccard similarity: 3/7 = 0.429 However it is not clear to me how to calculate similarity between two products. Let's say I want to calculate similarity between products 7 and 1 from previous example, how can one achieve that?
Category: Data Science

Implementing Frequently bought together using a DB

We have a classic structure of an online shop database (products, customers, sales) and we want to implement a Frequently bought together feature. Our software is in ASP.NET and we do not know PHP to reverse engineer how this is being done in Magento. And all we need is a simple Frequently bought together (not with discounts like Magento offers). I understand that this is machine learning and one of the more common ways is Jaccard coefficient. Is that the …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.