Combine multiple duplicate categorical variables into a single one for multiple linear regression

Question

Combine multiple duplicate categorical variables into a single one for multiple linear regression

mlbrulz

2022年6月4日 12:41

I am trying to create a regression model that predicts the box office success of a movie, with one of the explanatory variables being the actors who appear in the film.

My problem is that I decided to do the first 4 billed actors, but in the model, it is taking it as 4 separate variables (Actor 1, Actor 2, Actor 3, Actor 4). For example, Jack Nicholson is the lead in as good as it gets so he would be Actor 1, but in a few good men, he would be Actor 2, so the model doesn't recognize them as the same value for calculations.

I want the model to treat Actor 1 the same as Actor 4 for the inputs so that the order the actors are assigned does not impact the output. So (Tom Cruise, Brad Pitt) would be treated the same as (Brad Pitt, Tom Cruise). Is there a model/method that I could use to solve this problem? If my problem isn't clear I can clarify any further questions.

Topic machine-learning-model regression machine-learning

Category Data Science

Erwan · Accepted Answer · 2022年6月4日 12:41

The issue is just that you consider the list of actors as ordered, but if they are considered as an (unordered) set it works perfectly. The regular "bag of words" representation used in text can perfectly handle this, considering all the different actors as the distinct "words", i.e. the vocabulary.

The principle is simple: every actor is assigned an index $i$, for example by sorting the actors alphabetically. Every movie (instance) has a set of actors (can be any number) represented as an array of boolean values, where the index $i$ is 1 if and only if actor $i$ is in the movie.

Combine multiple duplicate categorical variables into a single one for multiple linear regression

About