What is the correct formula for Jaccard coefficient with integer vectors?
I understand the Jaccard index is the number of elements in common divided by the total number of distinct elements. But it seems to be some discrepancy or terminology confusion about Jaccard being applied to binary vectors, meaning a vector with binary attributes (0, 1), or, integer vectors, meaning any vector with integer values (2, 5, 6, 8).
There are two formulas depending on the type of elements in the vector?
This answer comments about binary vectors which they can be interpreted as sets of indices with value 1. However there are examples where Jaccard Coefficient is calculated with an integer vectors, so it seems to be valid. Besides, scikit-learn seems to define 3 cases:
Binary vectors
y_true = np.array([[0, 1, 1],
[1, 1, 0]])
y_pred = np.array([[1, 1, 1],
[1, 0, 0]])
Multilabel cases
(what is a Multilabel case is not defined in the scikit-learn documentation)
Multiclass problems are binarized and treated like the corresponding multilabel problem
y_pred = [0, 2, 1, 2]
y_true = [0, 1, 2, 2]
An additional test with a R library which uses the equation form
TP / (TP + FP + FN)
results in an undefined behavior:
library(ClusterR)
pc - c(0, 1, 2, 5, 6, 8, 9)
tc - c(0, 2, 3, 4, 5, 7, 9)
external_validation(pc, tc, method = jaccard_index)
[1] NaN
Is using the set based formula only suitable for binary vectors?
$$ J(A,B) = {{|A \cap B|}\over{|A \cup B|}} = {{|A \cap B|}\over{|A| + |B| - |A \cap B|}} $$
Topic metric jaccard-coefficient similarity clustering
Category Data Science