How can we perform STS (Semantic Textual Similarity) on unsupervised dataset using deep learning?

How do you implement STS(Semantic Textual Similarity) on an unlabelled dataset? The dataset column contains Unique_id, text1 (contains paragraph), and text2 (contains paragraph).

Ex: Column representation: Unique_id | Text1 | Text2

Unique_id 0

Text1 public show for Reynolds suspension of his coaching licence. portrait Sir Joshua Reynolds portrait of omai will get a public airing following fears it would stay hidden because of an export wrangle.

Text2 then requested to do so by Spain's anti-violence commission. The fine was far less than the expected amount of about £22 000 or even the suspension of his coaching license.

Unique_id 1

Text1 Groening. Gervais has already begun writing the script but is keeping its subject matter a closely guarded secret. he will also write a part for himself in the episode. I've got the rough idea but this is the most intimidating project of my career.

Text2 Philadelphia said they found insufficient evidence to support the woman s allegations regarding an alleged incident in January 2004. The woman reported the allegations to Canadian authorities last month. Cosby s lawyer Walter m Phillips jr said the comedian was pleased with the decision.

In the above problem, I've to compare two paragraphs of texts i.e. Text1 Text2, and then I've to compare semantic similarity between two texts. If they are semantically similar then it will print '1' if not then '0'

Any reference implementation link or any suggestions!

Thanks in advance!

Topic unsupervised-learning deep-learning nlp similarity

Category Data Science


Try Google universal sentence encoder, check out the Colab version of UC and just replace the example query in there with your two queries and it will give the similarity score between any two sentences. It is the best for calculating semantic text similarity.


You could take a pretrained embedder and look for distances between the embeddings. There's LASER from Facebook. This is an unofficial pypi package, it replaces some of the internal tools used for tokenization and BPE encodings. I have used it extensively and it works just fine. It encodes your text as a 1024-element numerical vector. Then you can just calculate distance metric between embeddings i.e. Euclidean.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.