how do tree based methods deal with missing feature columns?

Question

how do tree based methods deal with missing feature columns?

Maths12

2020年11月2日 14:37

all,

i have trained a model using xgboost. Some of the features are one hot encoded e.g. currency where it is either gbp or usd. it seems that when i output the feature importance gbp and usd were in 7'th 8th place respectively.

now i would like to use the model to predict whether defaulter or not on australian countries, however the currency for these is in AUD. Therefore when i apply my feature engineering it will create a column AUD once one hot encoded.

since my model doesn't have AUD as a feature how does it handle features which have been unseen? i am not clear on this

Topic dummy-variables one-hot-encoding xgboost decision-trees

Category Data Science

Noah Weber · Accepted Answer · 2020年11月2日 14:37

You can use Parameter to handle the unknown classes called handle_unknown in sklearn.preprocessing.OneHotEncoder

Best practice for any type of encoding :

You should train an estimator for Onehot encoding on the training data only, and when encoding test data, you should use the same estimator used on training data.

Eg : sklearn.preprocessing.OneHotEncoder does this, and it has a parameter called : handle_unknown.

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Optimal option is : You could use this parameter and set it to ignore, in order to ignore the unknown feature value and avoid an error, until you retrain your model eventually and add the new feature values to your model.

from sklearn.preprocessing import OneHotEncoder

ohe=OneHotEncoder(handle_unknown='ignore')

train=ohe.fit_transform(train)

test=ohe.transform(test)

how do tree based methods deal with missing feature columns?

About