Document ranking on a web scraped dataset without any labelled data
I want to create a document ranking model which returns similar rows in the dataset for a sample query. The text in this corpus is standard english but without any labels (ie no query-related documents structure). Is it possible to use a pretrained model trained on a large corpus (like bert or word2vec) and use it directly on the scraped dataset without any evaluation and get decent results? If not this, is training a model on the MS macro dataset and applying it on this corpus worth exploring?
Topic bert similar-documents text-mining nlp information-retrieval
Category Data Science