What is the difference between Okapi bm25 and NMSLIB?

I was trying to make a search system and then I got to know about Okapi bm25 which is a ranking function like tf-idf. You can make an index of your corpus and later retrieve documents similar to your query.

I imported a python library rank_bm25 and created a search system and the results were satisfying.

Then I saw something called Non-metric space library. I understood that its a similarity search library much like kNN algorithm.

I saw an example where a guy was trying to make a smart search system using nmslib. He did the following things:-

  • tokenized the documents
  • pass the tokens into fastText model to create word vectors
  • then combined those word vectors with bm25 weights
  • then passed the combination into nmslib
  • performed the search.

If the above link does not opens the document just open it in incognito mode.

It was quite fast, but the results were not satisfying, I mean even if I was copy pasting any exact query from the doc, it was not returning that doc. But the search system that I made using rank_bm25 was giving great results. So the conclusion was

bm25 gave good results and nmslib gave faster results.

My questions are

  • How do they both (bm25, nmslib) differ?
  • How can I pass bm25 weights to nmslib to create a better and faster search engine?
  • In short, how can I combine the goodness of both bm25 and nmslib?

Topic search-engine python-3.x nlp information-retrieval

Category Data Science


Note that I don't know nmslib and I'm not familiar with search optimization in general. However I know Okapi BM25 weighting.

How do they both (bm25, nmslib) differ?

These are two completely different things:

  • Okapi BM25 is a weighting scheme which has a better theoretical basis than the well known TFIDF weighting scheme. Both methods are intended to score words according to how "important" they are in the context of a document collection, mostly by giving more weight to words which appear rarely. As a weighting scheme, Okapi BM25 only provides a representation of the documents/queries, what you do with it is up to you.
  • nmslib is an optimized similarity search library. I assume that it takes as input any set of vectors for the documents and query. So one could provide them with vectors made of raw frequencies, TFIDF or anything else. What it does is just computing (as fast as possible) the most similar documents to a query, using whatever representation of documents is provided.

How can I pass bm25 weights to nmslib to create a better and faster search engine?

Since you mention that the results based on BM25 are satisfying, it means that the loss of quality is due to the nmslib search optimizations. There's no magic, the only way to make things fast is to do less comparisons, and sometimes that means discarding a potentially good candidate by mistake. So the problem is not about passing the BM25 weights, it's about understanding and tuning the parameters of nmslib: there are certainly parameters which allow the user to select an appropriate trade off between speed and quality.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.