Temporal train test split for recommender systems
When evaluating a collaborative filtering recommender system, it is practical to split the data temporally. However, by doing so, some users might be present in only either of the train or test set. For example, consider the below example:
user year
0 2020
0 2020
0 2021
1 2021
1 2021
1 2021
2 2020
2 2021
2 2021
If we decide to split by year such that ratings after 2020 will be in the test set, then:
Train
user year
0 2020
0 2020
2 2020
Test
user year
0 2021
1 2021
1 2021
1 2021
2 2021
2 2021
This means that user 1 will not be in the train set at all. When using matrix factorization/latent models, since user 1 is not in the train set, when we multiply the latent factors U and V to get back the predicted rating matrix, user 1 will not be in there at all, and thus we will not be able to predict the ratings for user 1. This applies to items as well, although it is not shown here.
How does one deal with that? Does one simply remove users that are not in the train set from the test set? Wouldn't this lead to a lot of data wastage?
Topic matrix-factorisation training recommender-system machine-learning
Category Data Science