Label Encode with pre defined classes

I have trained a model (Random Forest) and now I would like to use it to predict certain data on a particular day. I have a categorical column where there are some values (say a,b,c,d,e) over a period.

Now on a particular day, only some of those values are there (say b,d). Now while making them to one-hot encoding, I am using LabelEncoder and the one-hot encoder. But, if I give that column for label encoding, it is labelling only 'b' and 'd' (say 1 2) and the one-hot vector length is 2.

What I need is to say the actual model labelled (a,b,c,d,e) as (1,2,3,4,5), now I need (b,d) to be labelled as (2,4) and the one-hot vector to be of size 5.

What I am doing is saving the label encoder used in the training and using that one to label encodes the column on that day. But I am not getting proper results, am I doing it the right way?

I have given the length of onehot_train as n_values for the one-day data. I have used sklearn label encoder and one hot encoder.

That is my main question and another one is, suppose I see a new category which I haven't seen during training, how to proceed with that, should I consider all the new categories as an 'unknown' category and encode all of them as the same one hot or is there any better method?

def get_onehot(arr):
    label_enc = LabelEncoder()
    onehot_enc = OneHotEncoder(sparse=False)
    int_enc = label_enc.fit_transform(arr)
    int_enc = int_enc.reshape(len(int_enc),1)
    onehot = onehot_enc.fit_transform(int_enc)
    return onehot_train,label_enc_train

def get_onehot_per_day(arr_perday, label_enc_train, length_onehot_train):
    onehot_enc = OneHotEncoder(sparse=False,n_values=length_onehot_train)
    int_enc = label_enc_train.transform(arr)
    int_enc = int_enc.reshape(len(int_enc),1)
    onehot = onehot_enc.fit_transform(int_enc)
    return onehot_per_day

Topic labels feature-construction machine-learning

Category Data Science


It is unclear how the functions are being called. You might be creating two different OneHotEncoder instances. There should only be one OneHotEncoder instance.

It appears you are using scikit-learn. Scikit-learn has Pipelines to automatically handle this type of issue. If you switch from custom functions to Pipelines, you can correctly handle correct application of encoding to both training and test data sets.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.