How to remove outliers properly?
I was wondering what is the best practice for removing outliers from data. Plotting a boxplot for each feature (column of the dataset) and removing data that fall outside the whiskers seems like a naive and problematic approach. For example, say you have many individuals with a 'gender' label and an 'income' label. Also assume that there are many more men in the dataset than women. Unfortunately, due to income disparity we may see that women receive a lower wage than men, so if we were to simply plot a boxplot on the income feature and remove outliers we wouldn't be taking into account that some of those datapoints come from a different group (and furthermore, the assumption of more men than women means that we would likely remove a lot of the women from the dataset).
It seems like a better approach would be to remove outliers on a group-by-group basis, i.e., perform outlier analysis on individuals that share the same identifiers of sorts. Is there a way to do this in Python?
I am still learning data science so I'm sure this has a term that I am not aware of. Any insight or links to good resources would be greatly appreciated.
Topic preprocessing outlier reference-request python data-cleaning
Category Data Science