Nearest neighbors search for very high dimensional data
I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I performed, my assumption is that the method I will use will need to be either parallel or distributed. So I'm considering two classes of possible solutions: one that is either available (or implementable in a reasonably easy way) on a single multicore machine, the other on a Spark cluster, i.e. as a MapReduce program. Here are three broad ideas that I considered:
- Assuming a cosine similarity metric, perform the full multiplication of the normalized matrix by its transpose (implemented as a sum of outer products)
- Using locality-sensitive hashing (LSH)
- Reducing first the dimensionality of the problem with a PCA
I'd appreciate any thoughts or advices about possible other ways in which I could tackle this problem.
Topic map-reduce dimensionality-reduction distributed machine-learning
Category Data Science