How to plan a model analysis that avoids overfitting?

Coming from statistics, I'm freshly trying to learn machine learning. I've read a lot of tutorials about ML, but have no real training.

I'm working on a little project where my dataset have 6k lines and around 300 features.

As I've read in my tutorials, I split my dataset into a training sample (80%) and a testing sample (20%), and then train my algorithm on the training sample with cross-validation (5 folds).

As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results, with different sensitivity, specificity and precision.

I guess that if I re-run the program until metrics are good, my algorithm will be overfitted, and I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.

If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere? Is it even OK to do try several algorithms? (it would not always be in statistics)

In case it matters, I'm working with python's scikit-learn module.

*PS: my outcome is binary and my features are mostly binary, with few categorial and few numeric. I'm thinking about logistic, but which algorithm would be the best one ?

Topic project-planning machine-learning

Category Data Science


I guess that if I re-run the program until metrics are good, my algorithm will be overfitted

Re-running an algorithm does not contribute to over-fitting. Re-run and select the best model no problem. However, when we try to compare two algorithms we must average over large enough number of re-runs. However, in practice, we want the best model not the best algorithm.

I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.

Re-sampling the training set only contributes to what is known as model variance, which denotes the fact that different training samples yield different models.

As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results

Your observation is natural. A general approach to decreasing the variance of KNN is to increase the parameter $K$, higher $K$ means KNN looks at more points around a query point (see these plots).

If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere?

Random sampling is OK.

In python, sklearn.model_selection.cross_validate would build and validate K models for a given algorithm and returns K results; assuming we are feeding 80% of data to K-fold CV, it splits the 80% to 80(K-1)/K% training set and 80/K% validation set for K times.

In summary, first split the data 20%-80%, do cross validation on 80% of data for each algorithm (algorithms would be 1-NN, 2-NN, SVM, etc.), then select a model with the best [validation] accuracy (set return_estimator = True to get K models per algorithm from K-fold CV, so for 3 algorithms we are selecting among 3K models), and finally test the best model on the held-out 20% to get the test accuracy; cross-validation has no meaning here since the best model is already built. The final result is the test accuracy.

Also, take a look at this answer on train, validation, test sets.

Is it even OK to do try several algorithms? (it would not always be in statistics)

Yes. Always try various algorithms.

A side note on parameter finding

We can apply the above procedure on (1-NN, 2-NN, 3-NN), and if the winner model is always selected from 3-NN algorithm, then we can limit our experiments to only (3-NN, SVM) instead of all the four. Otherwise, if the winner is not consistently from 3-NN, we can experiment with all the four.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.