Using Spark for finding similar users to a user?

I read about https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html

but couldn't find a spark library for this implementation.

I have columnar string dataset.

I have a dataset with around data of 15-20 million users with their show_watched, times_watched, genre, channel and some more columns, I need to calculate lookalike/s for a user(or 100k users).

How do I find lookalikes for them within less time,

I have tried by indexing data in Solr, and then using Solr MLT for finding similar users, but that takes a lot of time, also it uses TF-IDF for MLT and I need users which have times_show_watched close to that user's times_show_watched.

Can anyone recommend a better approach for this, maybe using any other framework for faster processing?

I also tried to implement clustering using Spark MLLIB and later search in which cluster a user belongs so that search space is less, but I couldn't get this approach finished.

I am open to any approaches which would be efficient.

Thanks!

Topic similar-documents apache-mahout apache-spark

Category Data Science


PMC from Mahout here- we're in the middle of a site re-org at the moment, and things are... well they're a mess.

Here's a link to something I think is more useful. A tutorial on Co-Occurance in Spark.

http://mahout.apache.org/docs/latest/tutorials/cco-lastfm/

Re "A Spark Library" well, mahout IS the spark library.

To use Mahout (Scala only, sorry if you're a Python-phile, however the syntax, especially for Mahout is very pleasant), you either need to download mahout and run ./mahout spark-shell from the bin/ directory. Or if you like GUIs notebooks and Apache Zeppelin, check out this tutorial for setting up Mahout+Spark on Zeppelin

http://mahout.apache.org/docs/latest/tutorials/misc/mahout-in-zeppelin/

(If you are compiling a Jar, you just add Mahout as a dependency.)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.