Label Encode with pre defined classes
I have trained a model (Random Forest) and now I would like to use it to predict certain data on a particular day. I have a categorical column where there are some values (say a,b,c,d,e) over a period.
Now on a particular day, only some of those values are there (say b,d). Now while making them to one-hot encoding, I am using LabelEncoder and the one-hot encoder. But, if I give that column for label encoding, it is labelling only 'b' and 'd' (say 1 2) and the one-hot vector length is 2.
What I need is to say the actual model labelled (a,b,c,d,e) as (1,2,3,4,5), now I need (b,d) to be labelled as (2,4) and the one-hot vector to be of size 5.
What I am doing is saving the label encoder used in the training and using that one to label encodes the column on that day. But I am not getting proper results, am I doing it the right way?
I have given the length of onehot_train
as n_values
for the one-day data. I have used sklearn label encoder and one hot encoder.
That is my main question and another one is, suppose I see a new category which I haven't seen during training, how to proceed with that, should I consider all the new categories as an 'unknown' category and encode all of them as the same one hot or is there any better method?
def get_onehot(arr):
label_enc = LabelEncoder()
onehot_enc = OneHotEncoder(sparse=False)
int_enc = label_enc.fit_transform(arr)
int_enc = int_enc.reshape(len(int_enc),1)
onehot = onehot_enc.fit_transform(int_enc)
return onehot_train,label_enc_train
def get_onehot_per_day(arr_perday, label_enc_train, length_onehot_train):
onehot_enc = OneHotEncoder(sparse=False,n_values=length_onehot_train)
int_enc = label_enc_train.transform(arr)
int_enc = int_enc.reshape(len(int_enc),1)
onehot = onehot_enc.fit_transform(int_enc)
return onehot_per_day
Topic labels feature-construction machine-learning
Category Data Science