Which metrics for evaluating a recommender system with implicit data?

I am currently in the process of creating a recommender system. This recommender system works with a neural network and then searches for the closest neighbors and thus gives recommendations for a user. The data is implicit. I only have in the data which products a user has bought.On the basis of this data, I create the recommendations.

  • What are the best metrics to evaluate this recommender system with implicit data?

  • Can I evaluate the model and then the search algorithm? If so, are the metrics different? Which metrics do I have to use for what?

Topic metric recommender-system

Category Data Science

Offline evaluation is very tricky due to all kind of bias. The most prominent type of bias is position bias. I recommend the following paper (https://arxiv.org/pdf/1608.04468.pdf), which contains metrics I have used myself for monitoring and developement of recommendors for a large sport fashion company. The idea is to apply a counter-factual approach to unbias your estimators with the inverse propensity of the documents.

For example, take the sum of the ranks of the relevant results: enter image description here

As you are using implicity feedback for the relance r_i, it is expected that items y that were ranked higher for some query x. There is therefore an implicit probability of observation of an item given position. This probability can be used to reweight the metric as follows:

enter image description here

It can be shown that this estimator is unbiased as well. The same trick can be used for any metric you wish, NDCG, MAP etc... You can also apply for counter-factual estimation of Key Performance Indices, such as expected conversion, click-through rate, add to cart etc...

Unfortunately, one of the issues with such estimators is that they are known to have a large variance due to the weighting factor. I recommend this paper for explanation (https://arxiv.org/pdf/1801.07030.pdf)

There are two major types of evaluation - online and offline.

Online evaluation means showing the model's predictions to actual users. Since the goal of a recommender system to sell more products, the best overall best metric for a recommender system is increasing sales to actual users. This is best done by putting the model in production and A/B test if the model increase sales. This approach is not always possible given the limited resources (time or access to a production system).

Offline evaluation means simulating online evaluation by holding out existing data to evaluate the model.

If possible, split the data based on time. Train the model on the earlier data. Test the model on the later data. For a given pairing of product and user, the model will predict buy or not buy (binary classifier). The model can be evaluated as any binary classifier. For given domain, precision or recall may be more important.

However, this is not often times possible because time data many not be tracked or product-user pairs maybe sparse. If time is not tracked, the data may be split randomly to simulate time. If product-user pairs are sparse, product-user pairs are clustered along latent factors.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.