Dealing with missing data in SVD

Question

Dealing with missing data in SVD

David

2022年5月18日 08:00

I am a newbie to machine learning and I am trying to apply the SVD on the movielens dataset for movie recommendation. I have a movie-user matrix where the row is the user id, the column is the movie id and the value is the rating.

Now, I would like to perform normalization on the movie-user matrix (subtract the data by users ratings mean). Then pass the normalized matrix to Scipy.sparse svds as follow:

from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

Now, I have the 2 method to do it:

Method 1.) Fill all the missing rating with 0 first, then calculate the user rating mean for normalization.

The predicted rating dataframe for method 1 by using svd is:

Method 2.) Calculate the user rating mean first and do the normalization, then replace the missing rating with 0.

The predicted rating dataframe for method 2 by using svd is:

I would like to know which method is better or there are other methods to do it. So far as I can observe from method 2, the predicted ratings for a user are quite similar. For example, user A may get 4.XX ratings for all movies. Meanwhile in method 1, there are more variation. I would like to know if there are something wrong.

Topic movielens missing-data machine-learning

Category Data Science

Vincent Yong · Accepted Answer · 2020年3月24日 17:51

I would recommend trying both methods to see which works better.

However, in my opinion, you should fill first and then normalise it. Imagine if you normalise it first, you could end up with some values very close to 0. So when you fill in the missing values with 0, you are inherently saying that these missing values had the same initial value as those that were normalised to 0, which would be incorrect.

Dealing with missing data in SVD

About