I've clusterized my data according to 3 criteria in 3 groups. I used kmeans to obtain those cluster so the label for each cluster is random and changes at each script run. To evaluate the consistency of my clusters I decided to use Jaccard index but I can't understand how to apply it properly. Let's say I have this data where alpha beta and gamma are the 3 methods, and the Cluster Index is the value returned by K-means for …
A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext library in R has the following available string methods c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names: The left join shouldn't …
I have 2 list of of sets and I want to calculate a distance. set1 = [ {'A', 'B', 'C'}, {'A', 'D', 'X'}, {'X', 'A'} ] set2 = [ {'A', 'B', 'C', 'D'}, {'A', 'X'}, {'X', 'A', 'B'} ] So if the set of sets are equal I want the distance to be 0, and if unequal then I want the distance to be higher than 0. The exact distance doesn't really matter as I'll ultimately be aggregating to compare …
I am looking for a way to quantify the performance of multi-class model labelers, and thus compare them. I want to account for the fact that some classes are ‘closer’ than others (for example a car is ‘closer’ to a ‘truck’ than a ‘flower’ is. So, if a labeler classifies a car as a truck that is better than classifying the car as a flower. I am considering using a Jaccard similarity score. Will this do what I want?
I understand the Jaccard index is the number of elements in common divided by the total number of distinct elements. But it seems to be some discrepancy or terminology confusion about Jaccard being applied to "binary vectors", meaning a vector with binary attributes (0, 1), or, "integer vectors", meaning any vector with integer values (2, 5, 6, 8). There are two formulas depending on the type of elements in the vector? This answer comments about "binary vectors" which "they can …
I am using a Dice Coefficient based function to calculate the similarity of two strings: def dice_coefficient(a,b): try: if not len(a) or not len(b): return 0.0 except: return 0.0 if a == b: return 1.0 if len(a) == 1 or len(b) == 1: return 0.0 a_bigram_list = [a[i:i+2] for i in range(len(a)-1)] b_bigram_list = [b[i:i+2] for i in range(len(b)-1)] a_bigram_list.sort() b_bigram_list.sort() lena = len(a_bigram_list) lenb = len(b_bigram_list) matches = i = j = 0 while (i < lena and j …
I understand that Jaccard and Dice follow a monotonic relation on binary datasets because the two are related as $J = {S \over {(2 - S)}}$, and I guess this would be the case when micro-average is used with multi-label datasets. However, would the two metrics follow a monotonic relation when macro-average is used?
Im looking for the, if there is one, metric to compare ellipses with each other. Last time a had a similar dataset (malaria cells, now its pupiles) i used jaccard coefficient but that was more because of i didnt had the time to do further research on this topic. I jused the jaccard coefficient like that: - transform the multi-d data in 1D to make the comparation even possible. Even tho it worked quite well i didnt like it that …
I have 5400 rows of data and 3211 columns of attributes. The first 4 columns are ID/Name/ParentID/ObjectType - the rest of the 3207 columns are the attributes that are to be used for similarity measures. Huge dimensionality, I know, but I wanted to (as a first step) just see how this data clusters and finds similarity between all attributes. I converted all attributes values to "0" if there was no value, and "1" if there was a value. I thought …
I have a set of search results with ranking position, keyword and URL. I want to make a distance matrix so I can cluster the keywords (or the URLs). One approach would be to take the first n URL rankings for each keyword and use Jaccard similarity. However, I also want higher position ranks to be weighted more highly than lower position ranks - for example two keywords that have the same URL in positions 1 and 2 are more …
Calculating similarity between two users is rather straightforward. Consider following example: User A = {7,3,2,4,1} User B = {4,1,9,7,5} Products in common = {1,4,7} Union of products = {1,2,3,4,5,7,9} Hence the Jaccard similarity: 3/7 = 0.429 However it is not clear to me how to calculate similarity between two products. Let's say I want to calculate similarity between products 7 and 1 from previous example, how can one achieve that?
We have a classic structure of an online shop database (products, customers, sales) and we want to implement a Frequently bought together feature. Our software is in ASP.NET and we do not know PHP to reverse engineer how this is being done in Magento. And all we need is a simple Frequently bought together (not with discounts like Magento offers). I understand that this is machine learning and one of the more common ways is Jaccard coefficient. Is that the …