Why is my validation score so much higher using TargetEncoder?

So I'm experimenting a bit with an XGBoost model encoding the categorical variables using the target encoder from the category_encoders library. The code below shows how I split the dataset and fit the target encoder.

    X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2,
                                                                            random_state=70)

    ce_enc = ce.TargetEncoder()
    X_train[encode_name_lst] = ce_enc.fit_transform(X_train[encode_name_lst], y_train)
    X_test[encode_name_lst] = ce_enc.transform(X_test[encode_name_lst])

Now when I start training on the dataset using cross validation I see very good scores on the validation set (an AUC of ~0.92). But the test set only has an AUC of ~0.6. I guess this is due to the fact that the validation set is taken from the training set on which the target encoder is fitted?

Is there a way to fit the targetencoder better so I can get these results with the test set aswell?

Topic target-encoding xgboost

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.