Dealing with missing data in SVD

I am a newbie to machine learning and I am trying to apply the SVD on the movielens dataset for movie recommendation. I have a movie-user matrix where the row is the user id, the column is the movie id and the value is the rating.

Now, I would like to perform normalization on the movie-user matrix (subtract the data by users ratings mean). Then pass the normalized matrix to Scipy.sparse svds as follow:

from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

Now, I have the 2 method to do it:

Method 1.) Fill all the missing rating with 0 first, then calculate the user rating mean for normalization.

The predicted rating dataframe for method 1 by using svd is:

Method 2.) Calculate the user rating mean first and do the normalization, then replace the missing rating with 0.

The predicted rating dataframe for method 2 by using svd is:

I would like to know which method is better or there are other methods to do it. So far as I can observe from method 2, the predicted ratings for a user are quite similar. For example, user A may get 4.XX ratings for all movies. Meanwhile in method 1, there are more variation. I would like to know if there are something wrong.

Topic movielens missing-data machine-learning

Category Data Science


I would recommend trying both methods to see which works better.

However, in my opinion, you should fill first and then normalise it. Imagine if you normalise it first, you could end up with some values very close to 0. So when you fill in the missing values with 0, you are inherently saying that these missing values had the same initial value as those that were normalised to 0, which would be incorrect.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.