Combine multiple duplicate categorical variables into a single one for multiple linear regression
I am trying to create a regression model that predicts the box office success of a movie, with one of the explanatory variables being the actors who appear in the film.
My problem is that I decided to do the first 4 billed actors, but in the model, it is taking it as 4 separate variables (Actor 1, Actor 2, Actor 3, Actor 4). For example, Jack Nicholson is the lead in as good as it gets so he would be Actor 1, but in a few good men, he would be Actor 2, so the model doesn't recognize them as the same value for calculations.
I want the model to treat Actor 1 the same as Actor 4 for the inputs so that the order the actors are assigned does not impact the output. So (Tom Cruise, Brad Pitt) would be treated the same as (Brad Pitt, Tom Cruise). Is there a model/method that I could use to solve this problem? If my problem isn't clear I can clarify any further questions.
Topic machine-learning-model regression machine-learning
Category Data Science