How do outliers and missing values impact these classifiers?

I am currently working with a bunch of classification models especially Logistic regression, KNN, Naive Bayes, SVM, and Decision Trees for my machine learning class.

I know how to handle finding and removing the missing values and the outliers. But I would like to know which of the above models would perform really badly if the outliers and missing values are not removed. Like if I decide to leave the outliers and missing values in the dataset which model should I avoid? And how do I decide on which model to avoid?

Topic missing-data outlier

Category Data Science


In general, any model whose partitioning criteria (so how a model judges whether an input goes into a certain class) is dependent on the scale of the inputs (so how large/small a number is) will be affected by outliers.

For example, models based on exponential functions (like logistic regression) suffer from a problem called vanishing gradients, where an input that is too large in the positive/negative direction will cause the derivative of the model to become 0 when you run the gradient descent optimization on the loss function for this model. When this happens, the model will not train properly. This is why logistic regression inputs must ALWAYS be normalized/standardized appropriately.

For KNN, this can still be an issue because the KNN model depends on a metric in order to judge the nearest neighbors. In this case, if your K is high, the decision boundaries aren't as sharp anyways, so the outliers become less of an issue for high K.

Support Vector Machines are prone to errors due to outliers, and it's because of how they work. It tries to create a sharper decision boundary for our inputs by embedding the data into a hyperplane in the higher dimensional space and maximizing the distance of all the classes from your boundary. You could imagine why outliers are bad in this case, a huge outlier from one class will "pull" the boundary towards the outlier, and in this process, many data points may fall out of the "correct side" of the boundary.

On the other hand, decision trees are not extremely susceptible to outliers, because the partitioning criteria of decision trees are based on proportions and not on notions of "distance" or "loss". So an outlier data point in a decision tree would just take the path for the criteria that it meets, it does not affect the other data points.

I haven't had as extensive of work with Naive Bayes so I will refrain from discussing it.

As a general rule of thumb, any regression model or classification model involving a loss function or a metric will be sensitive to outliers because an extreme-valued data point will "drown out" the contributions of other less extreme data points.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.