Learning to Rank with Unlabelled Dataset

Question

Learning to Rank with Unlabelled Dataset

amber

2022年3月28日 22:57

I have folder of about 60k PDF documents that I would like to learn to rank based on queries to surface the most relevant results. The goal is to surface and rank relevant documents, very much like a search engine. I understand that Learning to Rank is a supervised algorithm that requires features generated based on query-document pairs. However, the problem is that none of them are labelled. How many queries should I have to even begin training the model?

Topic learning-to-rank search-engine xgboost ranking nlp

Category Data Science

Erwan · Accepted Answer · 2022年3月28日 22:57

There are different ways to look at this:

You can apply a totally unsupervised method, like computing a TDIDF vector for the query and then ranking according to its similarity (e.g. cosine) against every document. This requires no training at all, but you can't even evaluate the method.
You can use an already implemented system like ElasticSearch.
You can train a supervised ranking model with any number of samples, but obviously it's going to work much better with a large number. The first difficulty is to generate a sample of queries which is as representative as possible. The second difficulty is to find a way to select the top document(s) for every query: if done manually, the annotator needs to read 60k documents (ouch!). I'm not even going to talk about taking into account the subjectivity and potential ambiguity of a query.
You could try to do some form of semi-supervised learning or active learning. You could progressively refine the model by using user feedback for instance, if this works for your use case.

Learning to Rank with Unlabelled Dataset

About