Dummies Variables and Scaling in Regression Problems

Question

Dummies Variables and Scaling in Regression Problems

fflpdqqoeit

2021年4月13日 00:03

I was wondering if having dummies variable and scaling other variables could joke my model. In particular, I have implemented a Random Forest Regressor by using scikit-learn, but my data model is composed by a set of dummies varibles and 2 numerical variables. I approached in this way:

Convert categorical in dummies variables
Separate the numerical variables
Scale with Standard Scaler from scikit-learn the numerical variables (at point 2)
Join the dummies and numerical
Split train, test
train the model

Would this approach create an inappropriate bias considering the different scale from dummies and the scaler numerical? Or, at least, is it correct?

Topic dummy-variables random-forest

Category Data Science

10xAI · Accepted Answer · 2021年3月12日 14:38

Would this approach create an inappropriate bias considering the different scale from dummies and the scaler numerical? Or, at least, is it correct?

"It is fine" because One-Hot-Encoded features don't have a very large scale. Issue of scale comes when features are at a very different scale e.g. 1 vs 10⁵ etc.
Standardizing them will disturb the sparsity(i.e. log of zeroes). Normalizing them will not have any impact.

What is incorrect here is -
You are standardizing the dataset prior to the split. You must do the other way round.

What is optional here is -
You don't need scaling and OHE for a Tree-based model. Just label encode, split and move.

Dummies Variables and Scaling in Regression Problems

About