Should outliers be removed only from the target variable or from any variable where they are found?

Question

Should outliers be removed only from the target variable or from any variable where they are found?

letdatado

2022年4月17日 04:04

What I often do is that I check boxplots and histograms for target/dependent variable and after much caution, treat/remove the outliers. But this is what I do only for the target variable. I.e., if considered the removal, I'd simply drop the entire row where my target value was found outlying.

Suppose if I am having outliers in some independent variables as well. What should I do there?

Either,

Should I ignore them?

Or,

Should I take the same approach with Independent variables as I took with the target variable?

EDIT: Take the following example. Assume that we are predicting the expenditure of customers target_expenditure_USD. Other variables are Independent Variables

age	sex	last_purchase	target_expenditure_USD
34	M	12-02-2020	520,000
24	F	02-06-2019	2,234
43	F	10-08-2018	4,365
130	M	23-07-2020	1,424
45	F	12-01-1839	6,453

Thanks

Topic feature-scaling outlier statistics data-cleaning machine-learning

Category Data Science

user2974951 · Accepted Answer · 2021年11月9日 08:20

Continuing from the comments.

You should inspect all variables for outliers, not just your dependent variable (y). And if you find any outliers then you should do something about it.

If you are certain that they are in fact erroneous measurements then ideally you would drop the whole row. If, however, you cannot determine that (and it doesn't look like it) then you shouldn't just drop them or change them, but rather it would be better to keep your data as-is, maybe mention the weird values, and use robust models when analyzing your data, that is models which are robust to outliers.

Should outliers be removed only from the target variable or from any variable where they are found?

About