Jaccard Similarity with Binary Data
I have 5400 rows of data and 3211 columns of attributes.
The first 4 columns are ID/Name/ParentID/ObjectType - the rest of the 3207 columns are the attributes that are to be used for similarity measures.
Huge dimensionality, I know, but I wanted to (as a first step) just see how this data clusters and finds similarity between all attributes.
I converted all attributes values to "0" if there was no value, and "1" if there was a value. I thought it'd be an easy first step to get me started with a clustering visual and similarity metric if I converted the values to binary.
Jaccard similarity seems to be a good measure for binary, but I'm stumped as to how to implement this (in Python) when I don't have any lists for comparison. Am I supposed to hard code each variable into the algorithm (3207 variables)?
I'm not sure where to start. Also, if there's a better way of doing this, I'm all ears. I've been researching how best to tackle this problem and there's so many similarity metrics that can be used, but I'm stuck on how to start since there are so many columns that need to be used.
Topic jaccard-coefficient visualization python similarity data-cleaning
Category Data Science