Document ranking on a web scraped dataset without any labelled data

I want to create a document ranking model which returns similar rows in the dataset for a sample query. The text in this corpus is standard english but without any labels (ie no query-related documents structure). Is it possible to use a pretrained model trained on a large corpus (like bert or word2vec) and use it directly on the scraped dataset without any evaluation and get decent results? If not this, is training a model on the MS macro dataset and applying it on this corpus worth exploring?

Topic bert similar-documents text-mining nlp information-retrieval

Category Data Science


It depends on the type of ranking that you want to achieve, for example if the unlabeled scraped data can be ranked by sentiment, you can use Transfer Learning models to give each document a sentiment score which will serve as a rank if you return the sentiment score probability instead of having "positive" and "negative" tags.

Transfer Learning models usually give a good result but it's really up to your criteria for ranking the documents, and you should pay attention to the quality of the scraped data, it affects heavily the pre-trained model results.

Now since you have mentioned MS macro dataset, i'm assuming that your documents are maybe related to Question and Answer datasets, I think you should also take a look at The Stanford Question Answering Dataset.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.