Cross-validation average score
I am using Repeated K-folds (RepeatedKFold(n_splits=10, n_repeats=10, random_state=999)
from sklearn) to provide reliable scores for a linear regression on my dataset.
The dataset has some outliers which should stay and also similar cases can be seen in future observations. When a trained data in a fold tries to predict such observations, I get negative scores (at least, this is my interpretation)
Question: the main question is what should I do with one (or a few) bad score(s) out of many? How should I report them and how useful would that be?
Using 10 splits and 10 repeats for a data of size ~3000 observations, I will get 100 r-squared scores which are all in a good range (0.97
to 0.99
). There is only one guy ruining the game and the score is so bad (-11535
) that I cannot even get an average!
[ 9.87345591e-01 9.73912516e-01 ... -1.15353090e+04 ... 9.72986827e-01]
What shall I do in this case? how to report it and/or how to cure it?
Topic score linear-regression cross-validation python
Category Data Science