How to model a supervised recommender system with varying data
Suppose there are 2000 movies and a company wants to recommend some movies (for example, at most 5 movies) to each visitor. The objective is to learn how to predict which movie will be selected if a specific set of movies is recommended.
option-1 option-2 option-3 option-4 option-5 Selected-Movie
1. movie1 movie3 movie4 movie4
2. movie3 movie4 movie100 movie1000 movie1001 movie1001
3. movie4 movie5 movie34 movie34
Based on this data set, I want to learn when sample 1 is suggested to a customer, he will visit movie4. Because the number of features can be so high (here 2000 movies), I think it would not be a good option to use on-hot-encoding. Think at most 5 movies can be recommended, I thought it might be a good option to consider a vector with size 5 and if the number of recommended movies is less than 5, blanks will be replaced with 0. However, in this situation, the perturbation of movies will be important. For example, (1,2,3,4,5) will be different from (2,1,3,4,5) and I want to consider both cases the same. In other words, all 5! perturbations should be the same and there is no difference between them. Moreover, with this representation of data, I think it will not be possible to use Decision Tree whereas some algorithms like Catboost works.
My preference are algorithms that can generate rules like Decision Tree. I would be thankful if you have any recommendation for data representation and how features should be considered.