Similarity of search results using Jaccard
I have a set of search results with ranking position, keyword and URL. I want to make a distance matrix so I can cluster the keywords (or the URLs). One approach would be to take the first n URL rankings for each keyword and use Jaccard similarity
.
However, I also want higher position ranks to be weighted more highly than lower position ranks - for example two keywords that have the same URL in positions 1 and 2 are more similar than two keywords that have the same URL ranking in positions 39 and 40.
I saw suggested that I could add in higher ranks multiple times, for example:
keyword_1 = [1,1,1,1,1,2,2,2,2,3,3,3,4,4,5]
where the number is the id of the URLs that rank in positions 1-5 of keyword_1
.
This means that I can't use for example sklearn
Jaccard implementation because sets are assumed.
I had a go at implementing this myself and intuitively the results seem to make sense, but I would like it to run faster, as I could use data for rankings up to 100. There is a lot of looping involved - is there a way of using numpy better to make this code more efficient?
Alternatively, is there a different approach that I haven't found to use already built algorithms?
def jaccard_sim_with_dupes(item1, item2):
intersection = 0
for i in item1:
if i in item2:
intersection += 1
for j in item2:
if j in item1:
intersection += 1
union = len(item1) + len(item2)
return intersection / float(union)
def make_jaccard_distance_matrix(X):
shape = (len(X), len(X))
jaccard_distance_matrix = np.zeros(shape)
for i in range(shape[0]):
for j in range(shape[1]):
jaccard_distance_matrix[i][j] = 1 - JaccardDistanceMatrix.jaccard_sim_with_dupes(X[i], X[j])
if (i % 100 == 0):
print ("\r{0:.0%}".format(float(i) / shape[0]))
return jaccard_distance_matrix
Here X is a list of each keyword, where each keyword is represented as a list of URLs, with higher ranked keywords added in multiple times.
X = [[1,1,1,1,1,2,2,2,2,3,3,3,4,4,5], [6,6,6,6,6,1,1,1,1,7,7,7,8,8,9]]
make_jaccard_distance_matrix(X)
returns
array([[ 0. , 0.7],
[ 0.7, 0. ]])
Topic numpy jaccard-coefficient python
Category Data Science