ML: Classification Model Comparison

Given is a dataset that I need to use for a classification and I want to compare the performance of different classification models. Let's assume, I want to look at logistic regression (with different cut-off-points) and KNN. Is there anything problematic if I proceed as follows:

  1. Split data in training and validation data (and a test set for the performance evaluation of the winning model).
  2. Train a logistic regression model and a KNN classification model on the training set. I consider for each cut-off point t between 0 and 1 the logistic regression model as a classification model - so the regression model leads to many classification models.
  3. I now compare for a certain range of t (lets say 0.01 to 0.99) the classification performance of all my classification models (logistic regression for those t and KNN) on the validation data. The one with the best performance (based on a certain metric) I'll choose.

I was discussing this with somebody else who argued that t needs to be considered as hyperparameter and this parameter needs to be tuned separately. If this is true - why? And what's wrong with my arguments above?

Topic model-selection logistic-regression classification

Category Data Science


I was discussing this with somebody else who argued that t needs to be considered as hyperparameter and this parameter needs to be tuned separately.

In your exercise, you are actually doing the same thing. Getting the best t. So, I don't think you need anything extra.

What I see missing in your steps -
- No steps to get the best K(nearest_neighbours) for KNN
- No steps to optimize Logistics regression parm with regularization(in case needed)

On metrics -
(55 + 45 )/100 and (45 + 55)/100 have same accuracy.
You need to be sure what you want, one of the Classes or a balance metrics


DON'T USE ACCURACY! USE PROPER SCORING RULES!

What you propose is related to the area under the receiver operator curve, ROCAUC. ROCs plot the sensitivity and specificity (really 1-specificity) at all possible threshold cutoffs.

It sounds like you would pick the model that has the highest accuracy value, regardless of that threshold. If the best accuracy comes from logistic regression with a threshold of $0.6$, go with that model. If the best accuracy comes from KNN with a threshold of $0.07$, go with that model.

That sounds great, right, picking the most accurate model?

THIS IS INCORRECT, tempting as it sounds. Here are a few blog posts on this topic by a professor at Vanderbilt University and an active member on Cross Validated (the statistics Stack).

https://www.fharrell.com/post/class-damage/

https://www.fharrell.com/post/classification/

(Frank Harrell even has a post about how ROCAUC is flawed for model comparisons.)

Accuracy is a flawed performance metric. Any performance metric based on a threshold has considerable flaws. Please refer to this excellent post on the topic.

Shamelessly, I will link a question I posted on a similar topic that was answered by the same person with the same gist. Here is yet another post of his on this topic.

(I plan to accept that answer but don't want to yet so others might post their thoughts.)

An easy proper scoring rule to get you started is Brier score, basically square loss. Take the probability of being in class $1$, subtract the true class ($0$ or $1$), square that value, and add up those values for each prediction.

$$Brier(y,\hat{p}) = \sum_{i=1}^N \big(y_i-\hat{p}_i \big)^2$$

$y_i$ is the true class, $0$ or $1$, and $\hat{p}_i$ is the predicted probability (which will most likely be the predicted probability of being in class $1$). You can adjust Brier score if your software gives you the probability of being class $0$.


In principle you can use your approach.

However, you should not optimize on your test set (step 3). Instead you should select the best t using your validation set. Then you compare it against KNN, also on the validation set. Finally, the best model should be evaluated on the test set.


I would follow the following procedure:

  1. Split data into training and test datasets (and also validation set if you do not want to do k-fold cross-validation)
  2. Train different models using k-fold cross-validation to also find the best hyperparameters. One of the hyperparameters could be the discrimination threshold (cut-off point) that you talked about it.
  3. Use the models for the prediction of the test dataset to evaluate the performance of the models based on the unseen dataset. Now, you can choose the best model.

The general model selection is a little different and you need to use a statistical test as explained in this post

W.R.T the cut-off point, it should be noted that any parameters that it is not estimated using the training dataset it is considered as the hyperparameters.

You can compare the performance of all of your models considering different cut-off points. But that is not an efficient way. It would be better to compare the performance of the models in their best performance. It would be easier to find out in which case (i.e., with which hyperparameters and cut-off point) the model has the best performance and compares it with the other models in their best performance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.