Why is my validation score so much higher using TargetEncoder?
So I'm experimenting a bit with an XGBoost model encoding the categorical variables using the target encoder from the category_encoders library. The code below shows how I split the dataset and fit the target encoder.
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2,
random_state=70)
ce_enc = ce.TargetEncoder()
X_train[encode_name_lst] = ce_enc.fit_transform(X_train[encode_name_lst], y_train)
X_test[encode_name_lst] = ce_enc.transform(X_test[encode_name_lst])
Now when I start training on the dataset using cross validation I see very good scores on the validation set (an AUC of ~0.92). But the test set only has an AUC of ~0.6. I guess this is due to the fact that the validation set is taken from the training set on which the target encoder is fitted?
Is there a way to fit the targetencoder better so I can get these results with the test set aswell?
Topic target-encoding xgboost
Category Data Science