It is a shame they don't give a better reference for their O(log(N))
isn't it.
You asked for papers, and the obvious starting point is the one the SCaNN library is based on: https://arxiv.org/abs/1908.10396
What they say there (section 4) is:
After we quantize a database of n points, we can
calculate the dot product of a query vector q with
all quantized points in O(kd+ n) time. This is much
better than the O(nd) time required for the original
unquantized database
The next useful place to read is https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md
They partition large datasets using a tree. So there is an obvious O(log(N))
search complexity. For the 2 trillion token dataset in the 2112.04426 paper, the square root recommendation on that github page implies around 1.5 million partitions.
They also give a reference there for product quantization for approximate neighbour search: https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
Table II compares the time to find the k smallest distances, between SDC and ADC. (I am assuming the ADC there is the same as the AH, asymmetric hashing, used in ScaNN.) Unfortunately table II is badly formatted, and the two cells run into each other (?).
However I tracked down basically the same paper by the same author, https://hal.inria.fr/inria-00410767/document/ , which has a Table 2 that says the complexity is n + k log n
.
Which doesn't resemble anything in Table II in the other paper, and also isn't O(kd+n). A bit unsatisfying.
But, overall, I assume that the O(log(N)) claim is based on the tree search of the partitions. And then once you are searching inside each partition, it is effectively O(kd + √N)
, which is small enough to throw away?