How to estimate missing values when calculating NDCG

I would like to compare recommendations methods using NDCG metric on MovieLens dataset.

In ranking problem, the goal is to rank items based on their relevance for user. Ranking models can be learned based on ratings matrix, where each user rates small subset of all items. Ratings for other items are unknown.

Collaborative Filtering methods can be used to create model which generalize training datasets and predict ratings for unrated items.

Let's consider following example on dataset consisted of 5 movies. User A rated only 3 movies:

  • movie 1 - 5 stars
  • movie 3 - 3 stars
  • movie 4 - 2 stars

Model predict following results

  • movie 1 - 5 stars
  • movie 2 - 4 stars
  • movie 3 - 3 stars
  • movie 4 - 2 stars
  • movie 5 - 1 stars

How NDCG@3 should be calculated in this example ? Movie 2 get second best score but it has not been rated by user although it's highly relevant for user A based on other user ratings. Giving movie 2 1 star rating as ground true penalized model because it predicted highly relevant movie which was not rated by user.

Many papers measure model performance on MovieLens using NDCG, but I have not found details how NDCG is calculated. What is the best practice for solving this problem? Is it good idea to estimate unknown rating value based on movie ratings median or average ?

Topic movielens ndcg evaluation recommender-system

Category Data Science


Movie 2 get second best score but it has not been rated by user although it's highly relevant for user A based on other user ratings.

The issue with this is that you're injecting your opinion into what items are relevant to user A.

How NDCG@3 should be calculated in this example ?

You cannot evaluate your model on items the user has not rated. Generally with explicit feedback data you want to ignore missing values. Imputation may be possible, but it is more challenging and not commonly used. The correct modeling pipeline is to first make a training & testing set. Train your model on the training set and then for each user rank their known items that are present in the testing set (source: p.g. 24). Then you can compute various cutoff-sensitive metrics.

In your example, you should not be predicting ratings for movie 2 or 5. If your training set was movie 1 & movie 3 then you would make predictions for movie 4 -- obviously in a real application you should have more data so that there is more than one example in your test set.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.