Nearest neighbors search for very high dimensional data

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I performed, my assumption is that the method I will use will need to be either parallel or distributed. So I'm considering two classes of possible solutions: one that is either available (or implementable in a reasonably easy way) on a single multicore machine, the other on a Spark cluster, i.e. as a MapReduce program. Here are three broad ideas that I considered:

  • Assuming a cosine similarity metric, perform the full multiplication of the normalized matrix by its transpose (implemented as a sum of outer products)
  • Using locality-sensitive hashing (LSH)
  • Reducing first the dimensionality of the problem with a PCA

I'd appreciate any thoughts or advices about possible other ways in which I could tackle this problem.

Topic map-reduce dimensionality-reduction distributed machine-learning

Category Data Science


You should use : PySparNN, a recent implementation by Facebook in python which is bloody fast. It is also easy to use.


If you are working on collaborative filtering you should pose the problem as a low-rank matrix approximation, wherein both the users are items are co-embedded into the same low-dimensionality space. Similarity search will be much simpler then. I recommend using LSH, as you suggested. Another fruitful avenue for dimensionality reduction not yet mentioned is random projection.


I hope that the following resources might get you additional ideas toward solving the problem:

1) Research paper "Efficient K-Nearest Neighbor Join Algorithms for High Dimensional Sparse Data": http://arxiv.org/abs/1011.2807

2) Class project paper "Recommendation System Based on Collaborative Filtering" (Stanford University): http://cs229.stanford.edu/proj2008/Wen-RecommendationSystemBasedOnCollaborativeFiltering.pdf

3) Project for the Netflix Prize Competition (k-NN-based): http://cs.carleton.edu/cs_comps/0910/netflixprize/final_results/knn/index.html

4) Research paper "Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data" on the curse of dimensionality phenomenon and its relation to machine learning, in general, and k-NN algorithm, in particular: http://jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf

5) Software for sparse k-NN classification (free, but appears not to be open source - might clarify with authors): http://www.autonlab.org/autonweb/10408.html

6) Several discussion threads on StackOverflow:

7) Pay attention to GraphLab, an open source parallel framework for machine learning (http://select.cs.cmu.edu/code/graphlab), which supports parallel clustering via MapReduce model: http://select.cs.cmu.edu/code/graphlab/clustering.html

You might also check my answer here on Data Science StackExchange on sparse regression for links to relevant R packages and CRAN Task View pages: https://datascience.stackexchange.com/a/918/2452.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.