scalable tools to build kNN graph over sparse data

I'm looking for scalable tools to build kNN graph over sparse data points.

The dimension and number of data points can be both up to millions.

What I have tried already:

  • sklearn.neighbors.kneighbors_graph: which does brute-force search for sparse data, giving quadratic time.
  • flann: only supports dense arrays
  • pysparnn: the running time is not very satisfatory (maybe because it's written in Python)
  • knn search in mlpack: which only supports dense data
  • scipy.spatial.KDTree: which converts the sparse data to dense one
  • SparseLSH: which is implemented in Python, so I'm not quite sure about the scalability
  • elasticsearch: it seems to only support indexing documents, instead of sparse features.
    • the reason I thought of elasticsearch is: knn over sparse data can be framed as retrieving the top-k "documents" in IR.

Thanks for any comments/answers :)

Topic k-nn search-engine data-mining machine-learning

Category Data Science


I think what you might be looking for is,

L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning

They have multiple runtime options specifically for different kinds of datasets (including sparse data). The link for the same is : http://glaros.dtc.umn.edu/gkhome/node/1162

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.