What is the correct formula for Jaccard coefficient with integer vectors?

I understand the Jaccard index is the number of elements in common divided by the total number of distinct elements. But it seems to be some discrepancy or terminology confusion about Jaccard being applied to binary vectors, meaning a vector with binary attributes (0, 1), or, integer vectors, meaning any vector with integer values (2, 5, 6, 8).

There are two formulas depending on the type of elements in the vector?

This answer comments about binary vectors which they can be interpreted as sets of indices with value 1. However there are examples where Jaccard Coefficient is calculated with an integer vectors, so it seems to be valid. Besides, scikit-learn seems to define 3 cases:

Binary vectors

y_true = np.array([[0, 1, 1],
                   [1, 1, 0]])
y_pred = np.array([[1, 1, 1],
                   [1, 0, 0]])

Multilabel cases

(what is a Multilabel case is not defined in the scikit-learn documentation)

Multiclass problems are binarized and treated like the corresponding multilabel problem

y_pred = [0, 2, 1, 2]
y_true = [0, 1, 2, 2]

An additional test with a R library which uses the equation form

TP / (TP + FP + FN)

results in an undefined behavior:

library(ClusterR)
pc - c(0, 1, 2, 5, 6, 8, 9)
tc - c(0, 2, 3, 4, 5, 7, 9)
external_validation(pc, tc, method = jaccard_index)
[1] NaN

Is using the set based formula only suitable for binary vectors?

$$ J(A,B) = {{|A \cap B|}\over{|A \cup B|}} = {{|A \cap B|}\over{|A| + |B| - |A \cap B|}} $$

Topic metric jaccard-coefficient similarity clustering

Category Data Science


The Jaccard coefficient (or Jaccard similarity) is defined on two sets $A$ and $B$:

$$ J(A,B) = {{|A \cap B|}\over{|A \cup B|}} = {{|A \cap B|}\over{|A| + |B| - |A \cap B|}} $$

There is a single definition for this coefficient, but note that Jaccard is a general similarity measure, it is not specifically designed as an evaluation measure. So assuming one wants to use it as an evaluation measure, they need to decide (design) how. More precisely confusion may arise from the following questions:

  • Which pair of sets does one want to compare?
  • Which data structure does one use to represent a set?

Let's start with the second question: a binary vector is a common and convenient way to represent (encode) a set. Given a set of possible elements (universe) $U=\{x_1, ...,x_n\}$, any subset $s\subseteq U$ can be represented as a binary vector of length $n$ where the $i^{th}$ boolean is 1 if $x_i\in s$, 0 otherwise. It follows from this representation that the Jaccard coefficient is the number of indexes where both vectors contain 1 divided by the the number of indexes where at least one of the two vectors contain 1.

Example from scikit documentation:

>>> y_true = np.array([[0, 1, 1],
...                    [1, 1, 0]])
>>> y_pred = np.array([[1, 1, 1],
...                    [1, 0, 0]])
>>> jaccard_score(y_true[0], y_pred[0])
0.6666...

The vectors compared are [0, 1, 1] and [1, 1, 1] (meaning for instance sets $\{b,c\}$ and $\{a,b,c\}$), so the result is 2/3.

Multi-label classification is when an instance can be assigned any number of classes. In the linked scikit example they apply the Jaccard coefficient to the two sets of classes for every instance (optionally aggregating the results across instances). This is an example of a design choice as mentioned in my first question above. Note: in the case of the multiclass problem, I don't even understand how they obtain this result for the third value so I can't comment about it.

The formula $TP / (TP + FP + FN)$ is another example of what does one choose as sets. This formula only makes sense for a binary classification task: TP is the intersection of true and predicted positive instances, $TP+FP+FN$ is the union of all positive cases (whether true or predicted). This is why it doesn't work with integer vectors, since these cannot represent the result of binary classification.

Conclusion: one must define which sets they want to compare. The most simple way with integer vectors is to consider them as sets of integers, in this case they could be represented as binary vectors as follows:

[0, 1, 2, 5, 6, 8, 9] -> [1,1,1,0,0,1,1,0,1,1]
[0, 2, 3, 4, 5, 7, 9] -> [1,0,1,1,1,1,0,1,0,1]

But going through the binary representation is not a must, it's only needed if one wants to use a function which takes this format as input. One could as well define their own function using a set as data structure for instance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.