Question on embedding similarity / nearest neighbor methods [SCANN Paper]

Question on embedding similarity / nearest neighbor methods:

In https://arxiv.org/abs/2112.04426 the DeepMind team writes:

For a database of T elements, we can query the approximate nearest neighbors in O(log(T)) time. We use the SCaNN library [https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html]

Could someone provide an intuitive explanation for this time complexity of ANN?

Thanks!

A very Happy New Year Earthling's!

Topic deepmind word-embeddings deep-learning

Category Data Science


It is a shame they don't give a better reference for their O(log(N)) isn't it.

You asked for papers, and the obvious starting point is the one the SCaNN library is based on: https://arxiv.org/abs/1908.10396

What they say there (section 4) is:

After we quantize a database of n points, we can calculate the dot product of a query vector q with all quantized points in O(kd+ n) time. This is much better than the O(nd) time required for the original unquantized database

The next useful place to read is https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md

They partition large datasets using a tree. So there is an obvious O(log(N)) search complexity. For the 2 trillion token dataset in the 2112.04426 paper, the square root recommendation on that github page implies around 1.5 million partitions.

They also give a reference there for product quantization for approximate neighbour search: https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf

Table II compares the time to find the k smallest distances, between SDC and ADC. (I am assuming the ADC there is the same as the AH, asymmetric hashing, used in ScaNN.) Unfortunately table II is badly formatted, and the two cells run into each other (?).

However I tracked down basically the same paper by the same author, https://hal.inria.fr/inria-00410767/document/ , which has a Table 2 that says the complexity is n + k log n.

Which doesn't resemble anything in Table II in the other paper, and also isn't O(kd+n). A bit unsatisfying.

But, overall, I assume that the O(log(N)) claim is based on the tree search of the partitions. And then once you are searching inside each partition, it is effectively O(kd + √N), which is small enough to throw away?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.