Why does CV yield lower score?
My training accuracy was better than my test accuracy, hence I thought my model was over-fitted and tried Cross-validation. The model further degraded. Is that my input data need to be sanitised further and of better quality?
Please share your thoughts what could be getting wrong here.
My data distribution: Code snippets...
My function get_score
:
def get_score(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train.values.ravel())
pred0 = model.predict(X_test)
return accuracy_score(y_test, pred0)
Logic:
print('*TRAIN* Accuracy Score = '+str(accuracy_score(y_train, m.predict(X_train)))) # LinearSVC() used
print('*TEST* Accuracy Score = '+str(accuracy_score(y_test, pred))) # LinearSVC() used
print("... Cross Validation begins...")
y0 = pd.DataFrame(y)
y0.reset_index(drop=True, inplace=True)
print(X.shape)
print(y0.shape)
kf = KFold(n_splits = 10)
e = []
for train_index, test_index in kf.split(X):
X_train, X_test, y_train, y_test = X.iloc[train_index], X.iloc[test_index],y0.iloc[train_index], y0.iloc[test_index]
print(train_index, test_index)
e.append(get_score(LinearSVC(random_state=777),X_train, X_test, y_train, y_test))
print("Finally :: "+str(np.mean(e)))
Output:
*TRAIN* Accuracy Score = 0.9451327433628318
*TEST* Accuracy Score = 0.6597345132743363
... Cross Validation begins...
(9040, 6458)
(9040, 1)
[ 904 905 906 ... 9037 9038 9039] [ 0 1 2 ... 901 902 903]
[ 0 1 2 ... 9037 9038 9039] [ 904 905 906 ... 1805 1806 1807]
[ 0 1 2 ... 9037 9038 9039] [1808 1809 1810 ... 2709 2710 2711]
[ 0 1 2 ... 9037 9038 9039] [2712 2713 2714 ... 3613 3614 3615]
[ 0 1 2 ... 9037 9038 9039] [3616 3617 3618 ... 4517 4518 4519]
[ 0 1 2 ... 9037 9038 9039] [4520 4521 4522 ... 5421 5422 5423]
[ 0 1 2 ... 9037 9038 9039] [5424 5425 5426 ... 6325 6326 6327]
[ 0 1 2 ... 9037 9038 9039] [6328 6329 6330 ... 7229 7230 7231]
[ 0 1 2 ... 9037 9038 9039] [7232 7233 7234 ... 8133 8134 8135]
[ 0 1 2 ... 8133 8134 8135] [8136 8137 8138 ... 9037 9038 9039]
Finally :: 0.32499999999999996
Edit -1- Adding values of "e"
[0.08075221238938053, 0.413716814159292, 0.05752212389380531, 0.15376106194690264, 0.14712389380530974, 0.4668141592920354, 0.6946902654867256, 0.7112831858407079, 0.33738938053097345, 0.18694690265486727]
Edit -2- Adding shuffle=True
parameter to KFold()
Result:
[ 0 1 2 ... 9037 9038 9039] [ 4 5 10 ... 9007 9011 9024]
[ 0 1 2 ... 9037 9038 9039] [ 21 43 44 ... 9018 9035 9036]
[ 0 2 3 ... 9037 9038 9039] [ 1 20 60 ... 9023 9031 9034]
[ 0 1 2 ... 9036 9037 9038] [ 6 25 27 ... 9010 9025 9039]
[ 0 1 2 ... 9037 9038 9039] [ 15 16 28 ... 9029 9030 9033]
[ 0 1 2 ... 9037 9038 9039] [ 3 12 40 ... 9015 9017 9028]
[ 0 1 2 ... 9037 9038 9039] [ 7 8 23 ... 9013 9014 9027]
[ 0 1 3 ... 9035 9036 9039] [ 2 18 19 ... 9019 9037 9038]
[ 0 1 2 ... 9037 9038 9039] [ 24 37 39 ... 9012 9016 9026]
[ 1 2 3 ... 9037 9038 9039] [ 0 9 14 ... 9020 9021 9032]
[0.6504424778761062, 0.6736725663716814, 0.6969026548672567, 0.6692477876106194, 0.6769911504424779, 0.6382743362831859, 0.6692477876106194, 0.6648230088495575, 0.6648230088495575, 0.6814159292035398]
Finally :: 0.6685840707964601
Topic prediction multiclass-classification cross-validation
Category Data Science