Jaccard Similarity with Binary Data

Question

Jaccard Similarity with Binary Data

JessicaRabi

2019年8月17日 10:37

I have 5400 rows of data and 3211 columns of attributes.

The first 4 columns are ID/Name/ParentID/ObjectType - the rest of the 3207 columns are the attributes that are to be used for similarity measures.

Huge dimensionality, I know, but I wanted to (as a first step) just see how this data clusters and finds similarity between all attributes.

I converted all attributes values to "0" if there was no value, and "1" if there was a value. I thought it'd be an easy first step to get me started with a clustering visual and similarity metric if I converted the values to binary.

Jaccard similarity seems to be a good measure for binary, but I'm stumped as to how to implement this (in Python) when I don't have any lists for comparison. Am I supposed to hard code each variable into the algorithm (3207 variables)?

I'm not sure where to start. Also, if there's a better way of doing this, I'm all ears. I've been researching how best to tackle this problem and there's so many similarity metrics that can be used, but I'm stuck on how to start since there are so many columns that need to be used.

Topic jaccard-coefficient visualization python similarity data-cleaning

Category Data Science

Tasos · Accepted Answer · 2019年8月17日 10:37

The DBSCAN clustering algorithm has a built-in Jaccard distance metric.

from sklearn.cluster import DBSCAN
db = DBSCAN( metric='jaccard' ).fit(X)
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

Where X is your dataset with the related columns you want to use.

Jaccard Similarity with Binary Data

About