Semantic Search

Question

Semantic Search

Farhaan Bukhsh

2022年5月1日 17:03

There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles)

Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are:

Similar to that phrase
Has a part of sentence that is similar to the phrase
Sentence which is having contextually similar meanings

Let me try giving you an example suppose I search for the phrase "Buying Experience", I should get the sentences like:

I never thought car buying could take less than 30 minutes to sign and buy.

I found a car that i liked and the purchase process was straightforward and easy

I absolutely hated going car shopping, but today i’m glad i did

I want to lay emphasis on the fact that we are looking for contextual similarity and not just a brute force word search.

If the sentence uses different words then also it should be able to find it.

Things that we have already tried:

Open Semantic Search (https://www.opensemanticsearch.org/) the problem we faced here is generating ontology from the data we have, or for that sake searching for available ontology from different domains of our interest.
Elastic Search(BM25 + Vectors(tf-idf)), we tried this where it gave a few sentences but precision was not that great. The accuracy was bad as well. We tried against a human curated dataset, it was able to get around 10% of the sentences only.
We tried different embeddings like the once mentioned in https://github.com/UKPLab/sentence-transformers and also went through the example https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py and tried evaluating against our human curated set and that also had a very low accuracy.
We tried ELMO(https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604) this was better but still lower accuracy than we expected and there is a cognitive load to decide the cosine value below which we shouldn't consider the sentences. This even apply to point 3.

Any help will be appreciated. Thanks a lot for the help in advance

Topic semantic-similarity similar-documents unsupervised-learning word-embeddings similarity

Category Data Science

Franck Dernoncourt · Accepted Answer · 2021年11月18日 23:46

Similar to that phrase

You can try Phrase-BERT for phrase embeddings.

Paper: Wang, Shufan, Laure Thompson, and Mohit Iyyer. "Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration." EMNLP 2021.
Code.

The paper also mentions related previous work, e.g. SentBERT and SpanBERT.

Semantic Search

About