Similarity of search results using Jaccard

Question

Similarity of search results using Jaccard

HCg

2018年12月11日 15:01

I have a set of search results with ranking position, keyword and URL. I want to make a distance matrix so I can cluster the keywords (or the URLs). One approach would be to take the first n URL rankings for each keyword and use Jaccard similarity.

However, I also want higher position ranks to be weighted more highly than lower position ranks - for example two keywords that have the same URL in positions 1 and 2 are more similar than two keywords that have the same URL ranking in positions 39 and 40.

I saw suggested that I could add in higher ranks multiple times, for example:

keyword_1 = [1,1,1,1,1,2,2,2,2,3,3,3,4,4,5]

where the number is the id of the URLs that rank in positions 1-5 of keyword_1.

This means that I can't use for example sklearn Jaccard implementation because sets are assumed.

I had a go at implementing this myself and intuitively the results seem to make sense, but I would like it to run faster, as I could use data for rankings up to 100. There is a lot of looping involved - is there a way of using numpy better to make this code more efficient?

Alternatively, is there a different approach that I haven't found to use already built algorithms?

def jaccard_sim_with_dupes(item1, item2):
        intersection = 0
        for i in item1:
            if i in item2:
                intersection += 1
        for j in item2:
            if j in item1:
                intersection += 1
        union = len(item1) + len(item2)
        return intersection / float(union)

def make_jaccard_distance_matrix(X):
    shape = (len(X), len(X))
    jaccard_distance_matrix = np.zeros(shape)
    for i in range(shape[0]):
        for j in range(shape[1]):
            jaccard_distance_matrix[i][j] = 1 - JaccardDistanceMatrix.jaccard_sim_with_dupes(X[i], X[j])
        if (i % 100 == 0):
            print ("\r{0:.0%}".format(float(i) / shape[0]))
    return jaccard_distance_matrix

Here X is a list of each keyword, where each keyword is represented as a list of URLs, with higher ranked keywords added in multiple times.

X = [[1,1,1,1,1,2,2,2,2,3,3,3,4,4,5], [6,6,6,6,6,1,1,1,1,7,7,7,8,8,9]]

make_jaccard_distance_matrix(X)

returns

array([[ 0. ,  0.7],
   [ 0.7,  0. ]])

Topic numpy jaccard-coefficient python

Category Data Science

HCg · Accepted Answer · 2018年6月13日 08:07

I realised if I used the format

X = np.array([[5,4,2,3,1,0,0,0,0],[4,0,0,0,0,5,3,2,1]])

and a sparse matrix from the scipy module, I could use efficient matrix calculations. This ran in about 2 seconds for 7k samples and 400 features.

from scipy.sparse import csr_matrix
import numpy as np


def jaccard_sim_matrix(X):
    """X is an integer array of features"""

    sparseX = csr_matrix(X)

    # make a binary version of the matrix
    binX = sparseX
    binX.data[:] = 1

    intersection = ((sparseX * binX.T) + (binX * sparseX.T))

    rowwise_sum = np.sum(sparseX, axis=1)
    union = np.repeat(rowwise_sum, intersection.shape[0], axis=1) + np.repeat(rowwise_sum.T, intersection.shape[0],
                                                                              axis=0)

    return intersection / union

Then of course I got the distance by

1 - similarity

D.W. · Accepted Answer · 2018年6月12日 23:34

You can use a Counter to store the number of times each element appears. That class provides a way to compute the union and intersection using the & and | operators. sum(c.values()) will return the number of items in the result, counting multiplicities. This will let you implement jaccard_sim_with_dupes more efficiently (in linear time, rather than quadratic time).

Similarity of search results using Jaccard

About