Outlier treatment

I am working on a regression problem where I have a lot of outliers in multiple variables. As far as I can think of, there are 3 things I can do to outliers.

  1. Remove them (least attractive option)

  2. Transform them (log transformation, box-cox transformation etc)

  3. Do nothing and build a model including them

My question is regarding the second point. If I want to transform my features using any of the transformations solely for the purpose of outlier, is it ok to do it?

Topic transformation feature-engineering outlier python

Category Data Science


Although it is the least attractive, the best solution is to eliminate them. Including outliers, even if modified, goes a long way in modifying your dataset. For example, if your goal is to build a Machine Learning model, using modified data falsifies the training of your model and therefore gives you an unreliable result.

The whole thing is summarized by the principle "garbage in, garbage out", or if you use garbage data as input you will get garbage results. Therefore the cleanliness of the data is very important, better less data than more but not very reliable data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.