How to determine the "total number of relevant documents" in calculatiion of Recall in Precision and Recall if it's not known? Can it be estimated?

On Wikipedia there is a practical example of calculating Precision and Recall:

When a search engine returns 30 pages, only 20 of which are relevant, while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3, which tells us how valid the results are, while its recall is 20/60 = 1/3, which tells us how complete the results are.

I absolutely don't understand how one can use the Precision and Recall in real/life scenario of total number of relevant documents is needed.

For example, In my scenario, I have a set of about 9000 collected documents and I am creating a recommender system with several algorithms (like Tf-idf, Doc2Vec, LDA...). It has to recommend the TOP 20 most similar recommendations (articles) based on one selected article. Since I am not going to count the number of all relevant articles manually in 9000 documents for every recommender query, what is a relevant way to estimate the total number of relevant articles so that I can calculate Recall and then proceed to calculate Average Precision?

The only information I found about this problem are this lecture notes where they suggest to create pool of the result:

There are several ways of creating a pool of relevant records: one method is to use all the relevant records found from different searches, another is to manually scan several journals to identify a set of relevant papers.

But I'm trying to find more information on this method of pools elsewhere.

Common sense is telling me that this can be a valid approach: To take, say, 50 random documents and manually count the number of relevant documents in that random sample and estimate the total number of relevant documents from that. Can this be a valid approach? I imagine I could do this for a few recommendation results (although it would be a bit time consuming) or have some test users selected.

Topic learning-to-rank ranking evaluation information-retrieval recommender-system

Category Data Science


I think the answer to my question are "at k" ("@k") variants of above mentdioned methods: precision@k, recall@k, precision@k etc. I need to set the threshold to let's say TOP 20 (k=20) examples and then evaluate the results of precision and recall (by hand myself or by test users decision who will decide whether the recommendation is relvant or irrelevant). I found good practical examples here for anyone interested in the same problem at queirozf.com.

For example:

Recall @8 = true_positives@8 / (true_positives@8) + (false_negatives@8))

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.