Learning to Rank with Unlabelled Dataset

I have folder of about 60k PDF documents that I would like to learn to rank based on queries to surface the most relevant results. The goal is to surface and rank relevant documents, very much like a search engine. I understand that Learning to Rank is a supervised algorithm that requires features generated based on query-document pairs. However, the problem is that none of them are labelled. How many queries should I have to even begin training the model?

Topic learning-to-rank search-engine xgboost ranking nlp

Category Data Science


There are different ways to look at this:

  • You can apply a totally unsupervised method, like computing a TDIDF vector for the query and then ranking according to its similarity (e.g. cosine) against every document. This requires no training at all, but you can't even evaluate the method.
  • You can use an already implemented system like ElasticSearch.
  • You can train a supervised ranking model with any number of samples, but obviously it's going to work much better with a large number. The first difficulty is to generate a sample of queries which is as representative as possible. The second difficulty is to find a way to select the top document(s) for every query: if done manually, the annotator needs to read 60k documents (ouch!). I'm not even going to talk about taking into account the subjectivity and potential ambiguity of a query.
  • You could try to do some form of semi-supervised learning or active learning. You could progressively refine the model by using user feedback for instance, if this works for your use case.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.