Confusion on Outliers

I am not able to distinguish outliers: when to go with the std. dev. or when we need to go with the median.

My understanding on std. dev. is: if the data point is away from the mean by more than 2 std. dev., we consider that as an outlier. Similarly for the median, we say that any data point that is not in-between Q1 and Q3 is an outlier.

So I am confused as to which one to choose.

Can you guys help me understand?

Topic outlier statistics machine-learning

Category Data Science


It completely depends on the context of the data that is being considered. For example, $2\sigma$ from the mean ($\mu$), depends on the distribution of the data. What is the value of $\mathbb{P}(-2\sigma < X - \mu <2\sigma)$.

Also, there are many methods for outlier detection, and all of them depend on the context. Hence, you cannot say which method should be used by taking it outside the context. You should do some experiments by these methods over the data, and then base on the real outliers sample, decide which method is proper for the current data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.