Dealing with missing data in SVD
I am a newbie to machine learning and I am trying to apply the SVD on the movielens dataset for movie recommendation. I have a movie-user matrix where the row is the user id, the column is the movie id and the value is the rating.
Now, I would like to perform normalization on the movie-user matrix (subtract the data by users ratings mean). Then pass the normalized matrix to Scipy.sparse svds as follow:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)
Now, I have the 2 method to do it:
Method 1.) Fill all the missing rating with 0 first, then calculate the user rating mean for normalization.
The predicted rating dataframe for method 1 by using svd is:
Method 2.) Calculate the user rating mean first and do the normalization, then replace the missing rating with 0.
The predicted rating dataframe for method 2 by using svd is:
I would like to know which method is better or there are other methods to do it. So far as I can observe from method 2, the predicted ratings for a user are quite similar. For example, user A may get 4.XX ratings for all movies. Meanwhile in method 1, there are more variation. I would like to know if there are something wrong.
Topic movielens missing-data machine-learning
Category Data Science