Can I leave natural outliers in a dataset in training?

Can I leave unedited natural outliers in a dataset (outliers that have not appeared just because of mistyping of mistakes in the data)? Or should I also remove them or change them?

Topic pretraining outlier statistics

Category Data Science


Yes you should keep the natural outliers in a dataset. They represent an extreme end of the data you have and contain useful info. They also help you with anomaly detection if you wish.

But it also depends on the type of problem at hand. If for example in the case of Titanic dataset, where we are classifying who survived and who didn't. It is ok to remove the outliers as removing them won't be detrimental to the result. The passengers are already dead and removing the outliers won't lead to some serious loss.

On the other hand in the case of classifying weather a patient has a tumor or not, removing the outliers would be a bad idea, as it will lead to misclassifying and ultimately incorrect diagnosis/treatment.

If you are certain that these outliers are because of mistyping then you can safely remove them, but only if you are certain that they are because of mistyping. Else for a real world problem, it is always wise to keep the outliers.

Cheers!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.