Item Based Collaborative Filtering with No Ratings

I am building a recommender for web pages. For each web page in our data set, we wish to generate a list of web pages that other users have also visited.

Our data only shows that a user has either visited a page, or they have not. Users do not provide any ratings of our web pages. This is a good task for item based recommendation. However, most of the algorithms (such as the one in Mahout) requires rating data.

The first solution I came up with was to use a graph database and write a query which does the following:

For each page we want recommendations for, we search for all the users who have viewed that page. Then, for each of those users, we look up all other pages they have viewed. We then count the number of users which have viewed each page in this data set, and use those with the highest count as our recommendations.

While this works pretty well, our data set has grown substantially and scaling the graph database is difficult. The queries become slower as the number of page views in our data set increases. We would like to consider a different implementation before we commit to moving to a distributed graph database.

In a more traditional item-based recommender (like Mahout's), is there a good way to 'fake' the ranking data, or is there a popular open source implementation which does not requires the ranking data?

Topic apache-mahout recommender-system open-source

Category Data Science


You could try using other metrics to measure interest. An example for an article would be "time on page". If you measure scroll depth, even better. you could give a "5" rating if the user spent more than "n" seconds (where n is the time it takes to read the article, on average) or if the user scrolled all the way to the bottom.

a "3" could be if the user didn't show a lot of interest in the article, for example.


There might be different ways to do that, like considering implicit ratings like views or clicks.

But basically, you can consider a rating of 1.0 for each user-item pair you have.

This way, your prediction will be between 0 and 1 which you can consider similar to a click prediction probability.

  • 0 being the probability on not having a click or the item by a user
  • 1 being the probability that a click would be made.

You can also set a threshold where you can drop a recommendation at a certain limit of your prediction level.

E.g. Let's say an user alpha has the following recommendations with a threshold of 0.7

  • i1 , 0.87
  • i2, 0,75
  • i3, 0,6

You can drop the i3 recommendation since you don't consider it good enough.

Nevertheless, the parameter of your recommendation engine must be determined of course with the help of your evaluation metrics.

Considering software solutions, I use Apache Spark MLlib with Scala as a base for my recommendation engine algorithms where you can compute item cosine similarity easily per example where you are using in-house implementation or an approximation with the DIMSUM algorithm.

I hope this helps!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.