Binary Classification - One Hot Encoding preventing me using Test Set

I have a preprocessing pipeline that includes replacing missing values and onehotencoding for the categorical variables.

When I try to use my model on the test set, it explains that the number of columns it expects differs. This is due to one hot encoding

One option I considered was passing the full dataset into the pipeline and then seperating into test and split. However, this causes data leakage as the missing values it capturing values from the testset.

Please let me know how to prevent this.

Thanks,

Topic one-shot-learning encoding machine-learning

Category Data Science


You can use handle_unknown parameter of sklearn while encoding training data.

sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')

When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.

Note : I assumed you are using scikit-learn.

Source : sklearn.preprocessing


You need to apply one-hot-encoding before you split your data. Otherwise you will run into problems if there is a categorical attribute whose values are not all present in the train and test data.

It is a bit of guessing since I do not know what your data looks like but it might be what happened in your case. Here is a simple example. Suppose you have the following data sets obtained from your split before one-hot-encoding:

Train data:
     attribute_1
1        a
2        b

Test data:
     attribute_1
1        a
2        b
3        c

If you apply one-hot-encoding to these data sets separately you will end up with the following:

Train data:
     attribute_1_a     attribute_1_b
1        1                   0
2        0                   1

Test data:
     attribute_1_a     attribute_1_b     attribute_1_c
1        1                   0                 0
2        0                   1                 0
3        0                   0                 1

As you can see the columns of your train and test data do not match anymore. This can be solved by one-hot-encoding before splitting into train and test data.

And for the one-hot-encoding I do not see any problems with data leakage.

EDIT (based on your comment)

Alternatively, e.g. if you have missing data which you want to impute before one-hot-encoding, you can split the data first and then "manually" make sure that both datasets have the same attrributes.

For example like this:

# create example dataframes
df_train = pd.DataFrame({
    "attribute_1_a": [1, 0],
    "attribute_1_b": [0, 1]
})

df_test = pd.DataFrame({
    "attribute_1_a": [1, 0, 0],
    "attribute_1_b": [0, 1, 0], 
    "attribute_1_c": [0, 0, 1]
})

# add missing columns to test dataset with all values being 0
for i in df_train.columns:
    if i not in df_test.columns: df_test[i] = 0

# add missing columns to train dataset with all values being 0
for i in df_test.columns:
    if i not in df_train.columns: df_train[i] = 0

# use the same column order for the test set as for train
df_test = df_test.reindex(df_train.columns, axis=1)

Now the dataframes will look like this and have the same attributes:

In: df_train

Out: 
   attribute_1_a  attribute_1_b  attribute_1_c
0              1              0              0
1              0              1              0

In: df_test

Out: 
   attribute_1_a  attribute_1_b  attribute_1_c
0              1              0              0
1              0              1              0
2              0              0              1

However, check your datasets after this manipulation to make sure it went thru properly and you do not have any inconsistencies!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.