Mean encoding With KFold regularization

I just learned that regularizing the mean encoding reduce the leakage hence generalize better than mean encoding without it but I made 2 submissions with XGB in predict future sales competition on Kaggle with the naive mean encoding method and got RMSE = 1.152 and with 5 folds validation and got RMSE = 1.154 which was a surprise for my. Can any one explain why this may happen ? also after making the kfolds regularization every item_id has multiple mean encoding one for each fold so should I leave the train set in this format or take their mean ? also when applying this to the test should I take the mean and apply it or what ?

I'm encoding the item_id with the target which is the number of sold items :

Naive method :

all_data['item_target_enc'] = all_data.groupby('item_id') ['item_cnt_month'].transform('mean')

Regularization :

kf = KFold(n_splits = 5 , shuffle = False)

for train_ind, val_ind in kf.split(all_data) :

train_data , val_data = all_data.iloc[train_ind], all_data.iloc[val_ind]

all_data.loc[all_data.index[val_ind] , 'item_target_enc'] = val_data['item_id'].map(train_data.groupby('item_id')['item_cnt_month'].mean())


all_data['item_target_enc'].fillna(all_data.item_cnt_month.mean(), inplace=True) # fill na that happened from values in validation not in the train with the global mean

Here is the encoding result with regularization for each item_id i got an array which was confusing for my to take the mean or not :

Topic feature-engineering xgboost encoding python machine-learning

Category Data Science


Try this.

item_target_enc= df.groupby(['item_id'])['item_cnt_month'].mean().to_dict() 
  
df['item_id'] =  df['item_id'].map(item_target_enc) 

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.