Representing user information

I have a task of representing a users feature matrix , i have features like gender , age etc but I also have a multivalue feature called as "movies watched" which is essentially another table of movie names watched by that user with a numeric duration, the order of movies does not matter here. Also, movies watched can be from 20 movies to 300 movies. So what is the best way of representing this "movies watched" as a feature vector?

Topic representation feature-engineering feature-construction

Category Data Science


Hot Encoding

For each user, create a vector with the length of your movie catalog and add 1 to the movies the user watched and 0 to the ones the user did not watch.

This is a very naive approach but depending on your task might be enough.

Do keep in mind that your data is suddenly sparse, so choose algorithms that can deal with such data. The good thing is that any kernel above linear is often not necessary for sparse data.

Taking Duration into account

You could also do a Hot Encoding where instead of adding a 1 to the movies watched, you add the duration that the person watched.

However, this can start to become biased for movies that are longer than others. So, you can try two normalization techniques:

  1. Movie length normalization. Take the time the person watched the movie and divide be the actual length of the movie. (e.g. the user watched 75 mins, but the movie was 90 mins, so you get 75/90 = 0.83)
  2. TF-IDF usually used in NLP, but you can apply it here. This method also takes into account the frequency in which a movie is watched by all users. Balancing the weight between movies that are important (i.e. viewed by many users) and less important (i.e. moveis viewed by very few or just one user).

Other

Depending on your use case, you can start to be creative and group movies:

  • Group movies by genre (e.g. 10 genres) and count the time watched in each genre.
  • Check the percentage of the movie watched and group them into watched only the beginning, watched more than half and watched the full movie and count each category for each user.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.