Should i always transform data to normal distribution?

Question

Should i always transform data to normal distribution?

canP

2022年3月23日 17:33

I am trying to understand transformations but this question seems to be in my and some people's mind. If we have a numeric variable in EVERY data science case. Transforming data(Log, power transforms) into normal distribution will help the model to learn better? And stationarity. Stationarity is a different thing than transforming data to make it have a normal distribution. Is Transforming EVERY numeric data to stationery will make EVERY model learn better too?

Topic transformation deep-learning data-cleaning data-mining

Category Data Science

Jakob · Accepted Answer · 2022年3月23日 17:33

The short answer is no, you don't always need to transform your data to a normal distribution.

This depends a lot on the learning algorithm you're using. Additionally, you should treat continuous and categorical variables differently.

Continuous variables:

Tree-based models such as Decision Trees, Random Forest, Gradient Boosting, XGBoost, and others, are not affected by the distribution of your data.

However, algorithms like Linear Regression, Logistic Regression, KNN or Neural Nets can be highly affected by both the distribution and scale of your data. You will likely both get better results and finish training the model faster if you transform the data for these algorithms.

Categorical variables:

Independently of what algorithm you're using, you should one-hot-encode nominal categorical variables (this is the most common way, but there are other approaches such as Feature Hashing and Bin-counting that might work better if you have many categories). If they're ordinal, you should keep them as they are (given that they are integers, and if not, convert them to integers while maintaining the implied order).

Extra side note:

Also, make sure to not scale the entire dataset at once to prevent data leakage. Instead, scale your train set, then apply the same scaler on the test set, as explained in this SO answer.

Should i always transform data to normal distribution?

About