Semantic Search

There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles)

Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are:

  1. Similar to that phrase
  2. Has a part of sentence that is similar to the phrase
  3. Sentence which is having contextually similar meanings

Let me try giving you an example suppose I search for the phrase "Buying Experience", I should get the sentences like:

I never thought car buying could take less than 30 minutes to sign and buy.

I found a car that i liked and the purchase process was straightforward and easy

I absolutely hated going car shopping, but today i’m glad i did

I want to lay emphasis on the fact that we are looking for contextual similarity and not just a brute force word search.

If the sentence uses different words then also it should be able to find it.

Things that we have already tried:

  1. Open Semantic Search (https://www.opensemanticsearch.org/) the problem we faced here is generating ontology from the data we have, or for that sake searching for available ontology from different domains of our interest.

  2. Elastic Search(BM25 + Vectors(tf-idf)), we tried this where it gave a few sentences but precision was not that great. The accuracy was bad as well. We tried against a human curated dataset, it was able to get around 10% of the sentences only.

  3. We tried different embeddings like the once mentioned in https://github.com/UKPLab/sentence-transformers and also went through the example https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py and tried evaluating against our human curated set and that also had a very low accuracy.

  4. We tried ELMO(https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604) this was better but still lower accuracy than we expected and there is a cognitive load to decide the cosine value below which we shouldn't consider the sentences. This even apply to point 3.

Any help will be appreciated. Thanks a lot for the help in advance

Topic semantic-similarity similar-documents unsupervised-learning word-embeddings similarity

Category Data Science


Similar to that phrase

You can try Phrase-BERT for phrase embeddings.

The paper also mentions related previous work, e.g. SentBERT and SpanBERT.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.