Jaccard similarity between two items

Calculating similarity between two users is rather straightforward.

Consider following example:

User A = {7,3,2,4,1}
User B = {4,1,9,7,5}

Products in common = {1,4,7}
Union of products = {1,2,3,4,5,7,9}

Hence the Jaccard similarity: 3/7 = 0.429

However it is not clear to me how to calculate similarity between two products. Let's say I want to calculate similarity between products 7 and 1 from previous example, how can one achieve that?

Topic jaccard-coefficient similarity

Category Data Science


In any commerce setting, the concept of item similarity is very not straightforward. Two users usually buying same kinds of products can be considered as similar, but we cannot say the same about two items bought by same user.

There are two different concepts of item similarity for recommendation purposes. One is, if the two items are physically similar, for example : Blue Reebok Shoes and Red Reebok Shoes, and other is if they have functional dependency on each other, for example: Reebok Shoes and Reebok Socks. For finding physically similar items, one can create a dictionary of attributes defining the product and do Jaccard similarity on those attributes. For example:

Item A = {color: Blue, size: 10, material: Cotton, brand: Reebok}

Item B = {color: Red, size: 10, material: Cotton, brand: Reebok}

Thus, the intersection of sets would be number of attributes that match up, i.e.

Intersection(A,B) = {size, material, brand}

Union(A,B) = {color, size, material, brand}

Jaccard Index = 3/4 = 0.75

For finding behaviorally dependent items, one proxy that is generally used is more the two items are bought together in same session, more dependent they are for each other's functioning, thus more valuable recommendation. For this setting, one can create a matrix of products a user buys in a single session. For m users and n products it would be sparse m X n matrix. If we read the same matrix column wise, that would be set of users who bought the item in a particular session.

Thus,

Item A = {Ua, Ub, Uc}

Item B = {Ub, Ud}

Jaccard Index = 1/4 = 0.25


There are various ways to do it. One way is item-item collaborative filtering. Say you have 100 users and 100 songs which will be in this case a n[# of users] X 101 [user and songs column] matrix. Also, each user has liked x out of the 100 songs

Now, disregard the user column and create a 100 x 100 song matrix. Now, for each song i.e. column you calculate the cosine similarity with other 99 songs (this will be cosine of song1 with n rows with song2 [n rows] and so on). Finally, for each song you will get a similarity value. Basically, songs which a lot of users like with other songs will have a greater value of similarity. However, going ahead with the assumption that a listener usually likes certain kind of song, a user may like only 10 songs (similar kind) and this way we have inputs from n other users. Finally, we will be able to calculate similarity for each songs based on the inputs from n users.

You can then sort and recommend the similarity score with highest value for each song to the user.

Hope this helps!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.