collaborative filtering using graph and machine learning

What are the advantages and disadvantages of using Collaborative filtering based recommendation using machine learning approach and graph based approach ?

Say I have user purchase data (user_name, user_location, user_company_name, product_name, product_price, product_ingredients) and would like to recommend product for user based on what other user from the same location, company are buying, based on product price, ingredients etc.

How to decide on which of them is suitable for a given use case? I would like to evaluate Neo4j (Graph database) and Mahout (Machine learning) for Collaborative filtering.

Topic apache-mahout graphs neo4j machine-learning

Category Data Science


One advantage of many ML-based recommendation techniques is they allow you to work in a lower-dimensional space. Matrix factorization techniques for example, allow you to view a user or an item in terms of a learned latent-variable space. This allows for easier computations after the model has been trained (which is often very expensive).

Some of the best results on the MovieLens recommendation dataset have been achieved by autoencoders. These also have the benefit of reducing the dimensionality of the problem.

Furthermore learned representations of the data likely represent (perhaps non-linear, in the case of auto-encoders) combinations of features. Navigating the edges of a graph is likely to focus on one feature at a time.

If your idea for using neo4j came from here, one thing to remember is that the data you're talking about is not just ratings/likes data (common in collaborative filtering), but also content-based data. You might want to read about hybrid recommender systems to leverage both content-based and collaborative filtering based recommendation.


I can only speak about Graphs:

Advantages:

  • Using graphs, you can easily find products bought/rated by users that bought or liked an item, or users that have similar "taste" to another user. From my experience the traversal process is fast enough.
  • Closely matched products are easy to find, depending on how you model your graph: e.g. (userA)-[]->(basketA)-[with]->(productA)<-(basketB)-[]-(userB). Finding similar products by basket is computationally cheap here

Disadvantages:

  • Depending on how you rank your results, you can easily run into the trap of constantly recommending the same products. Users that bought A, bought B. So you suggest B. People buy/like B. Next time you run the query, B will keep coming on top.
  • Your results will rarely uncover new products. You'd have to find smart approaches around this, e.g. find other products, within this price range, with this category
  • If you have very large data sets, you'll have to use a distributed graph database that performs well (this is not as easy as it sounds, unless you're willing to pay large sums of money)

My little experience with ML for collaborative filtering, is that when your data grows large (50GB+), building a model takes a considerable amount of time (hours, days), and you're not likely to get good recommendations on new products. Having to update your model becomes a huge problem too. From my experience, I lean towards graphs for small use cases.

Note that for both cases, you never recommend newly added products that have not been bought or liked by someone. I say this because the goal of recommendations is to help users uncover new products. There is a blog post I wrote, where I talk about modelling graphs, and I discuss recommendations in some examples here, so you can skip to that section to get an idea on how to model this kind of problem.

I advise you to read up on Amazon's paper on their approach to collaborative filtering, which is pretty simple, in theory, and yields good results for them. I think you'd always want to implement a few different approaches combined, to tackle different parts of the problem.

Good luck.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.