How to plan a model analysis that avoids overfitting?
Coming from statistics, I'm freshly trying to learn machine learning. I've read a lot of tutorials about ML, but have no real training.
I'm working on a little project where my dataset have 6k lines and around 300 features.
As I've read in my tutorials, I split my dataset into a training sample (80%) and a testing sample (20%), and then train my algorithm on the training sample with cross-validation (5 folds).
As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results, with different sensitivity, specificity and precision.
I guess that if I re-run the program until metrics are good, my algorithm will be overfitted, and I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.
If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere? Is it even OK to do try several algorithms? (it would not always be in statistics)
In case it matters, I'm working with python's scikit-learn module.
*PS: my outcome is binary and my features are mostly binary, with few categorial and few numeric. I'm thinking about logistic, but which algorithm would be the best one ?
Topic project-planning machine-learning
Category Data Science