What if outliers still exist after variable transformation?

I have a variable with a skewed distribution.

I applied BoxCox transformation and now the variable follows a Gaussian distribution. But, as seen in the image below in the boxplot, outliers still exist.

My question is:

Although after transformation, the variable distribution is nearly Gaussian, if there are still outliers, should we still select this transformation?

Or should we decide to use other techniques such as discretization in order to capture all outliers?

Topic transformation feature-engineering outlier

Category Data Science


No right way in all cases. I have dealt with outliers from statistics and business problem view.

  1. Are the outliers in a segment the business is expanding into and expect more people in these "outlier" areas? In this case these are not outliers from the business view and probably should be kept.
  2. Are the outliers in a segment the business is retreating from and expect fewer people in these areas. Possibly want to get rid of these records.
  3. Are these outlier records just outlier in this feature or in others? If they are just far from the mean in 1 feature, might want to keep them.
  4. Discretizing - I am not a fan since the model is losing information. But can add an indicator variable and try the model. Might want to try multiple bucket approaches but I still think the model should see the real numbers.
  5. I built models where the model would always get these outliers correct or incorrect regardless of if they were in the model training or not. So including these was a moot point. Want to make sure they are in a validation set to check. Make an outlier validation set.
  6. Is the business treating these people different regardless of the model. For example, if this is a marketing model, the business might be targeting high income people regardless of what the model says. So including these in the model may penalize lower income people. Try building the model with and without.
  7. I am sure I am missing other techniques that I have done. Outliers do not mean wrong and they do not mean right. They can be looked at from a purely statistical view. But in building a model since we should know something about the business problem and have access to subject matter experts, we can probably have a better informed decision. And always test. Might need to build multiple models.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.