Categorical feature encoding

I am making a classification model. I have categorical and continuous data. The categorical columns include columns with 2 classes such as sex (male, female), and multi-class columns such as location.

I need to encode these to numeric values. I would do one-hot-encoding and drop first column but it is not realistic on an unseen test data that may have unseen values. so I have planned to do one-hot-encoding with handle_unknown='ignore' .

However, my problem is that I am afraid of the multicollinearity this presents in the data, especially for the columns with 2 classes.

The solution I have thought of is only applying LabelEncoder on the columns with 2 classes, and one-hot-encoder for the rest. This way the effects of multicollinearity is lessened.

does that seem right?

Please let me know what you think. Thank you.

Topic categorical-encoding one-hot-encoding encoding classification machine-learning

Category Data Science


You do not need to do one-hot encoding for a variable with 2 categories. You simply need to index as 1 and 0.

As a matter of fact, if you did do one hot encoding for a binary variable then you would end up with a multi-collinearity as you said, as those variables would be exactly the same.

The same thing occurs with any other categorical variable. If you have a variable with 3 categories, and you do one hot encoding, all the information you need is carried in two of the one hot encoded variables. Therefore, you should also do drop first for any categorical variable. I am not sure what you mean by unseen data or why that would matter.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.