Best method for similarity searching on 10,000 data points with 8,000 features each in Python?

As mentioned in the title I am attempting to search through 10,000 vectors with 8000 features each, all in Python. Currently I have the vectors saved in their own directories as pickled numpy arrays. The features were pulled from this deep neural network.

I am new to this space, but I have heard about M-Trees, R-Trees, inverted tables, and hashing. Is any one of them better for such a large number of features?

This implementation needs to be done quite rapidly and is just a prototype, so simplicity is valuable.

Thank you for your help.

Topic search-engine python search

Category Data Science


There are two main paths:

  1. Load all vectors into memory. If you are able to load vectors into memory, then you might be able to search the space with "clever" brute force. One such method is found in this paper.

  2. Keep vectors on disk. If you follow this path, then you have to index the vectors. You are basically building a search engine. Common open source search engines are: Apache Solr and Elasticsearch

It also depends on what kind of what kind of search is needed. A working definition of "close" vector is needed. The most common definitions of "close" vectors can be found here.

Fais is a new library for efficient similarity search of vectors. It is designed for many vectors (> 1,000,000) each being relatively small (10s to 100s of dimensions). It may or may not scale to your problem. If it does not scale because of high dimensionality, you can reduce dimensionality with Principal component analysis (PCA) or t-SNE.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.