How to use the eval set in catboost appropriately?

Let's say you have a dataset, and you split it into 80% training and 20% testing. Naturally, you want to find the optimal hyperparameters for your model, so with the training set, you plan to do cross validation and search parameter space.

CatBoost has something called the eval set which is used to help avoid overfitting, but I have a fundamental question on how to use it appropriately.

Say you do CV10. So now we have 10 iterations where 90% of the TRAINING dataset is used to predict the remaining 10%.

In words, is it fair in every cv iteration to use the 10% that is not being fit as the eval set to avoid overfitting and then still predict on those features and report the result? Or have we cheated by making the 10% of the training that is not being fit the eval set because we stopped the training early as a function of that set.

This same concept applies to the testing set. After finding optimal hyperparameters, can I use the testing set as the eval set during final model training? Or is this again cheating?

See the following pseudocode if the words are confusing:

for trial in range(num_cv_trials):
    clf = CatBoostClassifier()
    clf.fit(cv_iteration_training_features, cv_iteration_training_labels, eval_set=[cv_iteration_testing_features, cv_iteration_testing_labels])

    preds = clf.predict(cv_iteration_testing_features)     
    f1_score = f1_score(cv_iteration_testing_labels, preds)

Topic catboost theory supervised-learning machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.