How to remove outliers properly?

I was wondering what is the best practice for removing outliers from data. Plotting a boxplot for each feature (column of the dataset) and removing data that fall outside the whiskers seems like a naive and problematic approach. For example, say you have many individuals with a 'gender' label and an 'income' label. Also assume that there are many more men in the dataset than women. Unfortunately, due to income disparity we may see that women receive a lower wage than men, so if we were to simply plot a boxplot on the income feature and remove outliers we wouldn't be taking into account that some of those datapoints come from a different group (and furthermore, the assumption of more men than women means that we would likely remove a lot of the women from the dataset).

It seems like a better approach would be to remove outliers on a group-by-group basis, i.e., perform outlier analysis on individuals that share the same identifiers of sorts. Is there a way to do this in Python?

I am still learning data science so I'm sure this has a term that I am not aware of. Any insight or links to good resources would be greatly appreciated.

Topic preprocessing outlier reference-request python data-cleaning

Category Data Science


No data point should be removed under any circumstances unless you are somehow truly convinced that the data point was acquired in error. The so called "outliers" convey a great amount of information about the system at its boundaries. If your data does contain a point, then how can you justify removing that data point. From an information theory point of view, since "outliers" have such small probability of occurring, they have a large informational content. Removing these points throws away all the information that you were so lucky to observe. The data points that occur with high probability are virtually void of information.


Yes, the problem of imbalance is indeed genuine while pre processing. There are no hard and fast rules for removing outliers, but generic methodologies (percentile,boxplot,Z-score etc). Like gender, if you take salary of all employess then removing outliers means eliminating all highly paid employees.That will make your model learn more about middle/average salaried employes(Outliers handling). But then if you keep them, they will influence and model will learn less about average salaried employee.(Making data in Log scale can minimise that though:not fully)

The solutions are generally more prone to objective and training we want to give. After pre processing, the imbalance(like gender) in data can be compensated by oversampling or undersampling.(we can get more datapoints:dont worry about that). But be sure before dropping any data that is available !! Taking few columns(similar feature) at a time to process and dealing with them, instead of applying a common operation in groups,can generally works.

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561 These might be helpful.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.