Dummies Variables and Scaling in Regression Problems

I was wondering if having dummies variable and scaling other variables could joke my model. In particular, I have implemented a Random Forest Regressor by using scikit-learn, but my data model is composed by a set of dummies varibles and 2 numerical variables. I approached in this way:

  1. Convert categorical in dummies variables
  2. Separate the numerical variables
  3. Scale with Standard Scaler from scikit-learn the numerical variables (at point 2)
  4. Join the dummies and numerical
  5. Split train, test
  6. train the model

Would this approach create an inappropriate bias considering the different scale from dummies and the scaler numerical? Or, at least, is it correct?

Topic dummy-variables random-forest

Category Data Science


Would this approach create an inappropriate bias considering the different scale from dummies and the scaler numerical? Or, at least, is it correct?

"It is fine" because One-Hot-Encoded features don't have a very large scale. Issue of scale comes when features are at a very different scale e.g. 1 vs 105 etc.
Standardizing them will disturb the sparsity(i.e. log of zeroes). Normalizing them will not have any impact.

What is incorrect here is -
You are standardizing the dataset prior to the split. You must do the other way round.

What is optional here is -
You don't need scaling and OHE for a Tree-based model. Just label encode, split and move.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.