Categorical feature encoding
I am making a classification model. I have categorical and continuous data. The categorical columns include columns with 2 classes such as sex (male, female), and multi-class columns such as location.
I need to encode these to numeric values. I would do one-hot-encoding and drop first column but it is not realistic on an unseen test data that may have unseen values. so I have planned to do one-hot-encoding with handle_unknown='ignore' .
However, my problem is that I am afraid of the multicollinearity this presents in the data, especially for the columns with 2 classes.
The solution I have thought of is only applying LabelEncoder on the columns with 2 classes, and one-hot-encoder for the rest. This way the effects of multicollinearity is lessened.
does that seem right?
Please let me know what you think. Thank you.