Changing order of LabelEncoder() result

Assume I have a multi-class classification task. The labels are:

  • Class 1
  • Class 2
  • Class 3

After LabelEncoder(), the labels are transformed into 0-1-2.

My questions are:

  • Do the labels have to start from 0?
  • Do the labels have to be sequential?
  • What happens if I replace all label 0s with 3 so that my labels are 1-2-3 instead of 0-1-2 (This is done before training)
  • If the labels were numeric such as 10-100-1000, will I still have to use LabelEncoder() to encode them into 0-1-2?

Topic encoder labels multilabel-classification scikit-learn

Category Data Science


Do the labels have to start from 0?

  • No it doesn't matter where they start as long as they have distinct values.

Do the labels have to be sequential?

  • Well it depends from the feature. For example if you have features that are showing order of magnitude, like small<big<vast, then yes the order matters and they are called ordinal features, but if the feature's values represent for example countries then there is no such thing as order, so probably one should use OneHotEncoder, in order to be equally distanced in space. (see here)

What happens if I replace all label 0s with 3 so that my labels are 1-2-3 instead of 0-1-2 (This is done before training)

  • Except the previous bullet, one should consider the type of model that will use. For example tree based model like RandomForest work very well with categorical data, and the numerical value of a category could be arbitrary. But this is not the case for the linear models.

Closing if you want to convert a categorical feature to numerical values, you should consider two things the features values (ordinal?) and the type of model.

P.S. To improve the performance of the model there are many way to convert categorical to numerical features, like Target encoding techniques, that have been showing to improve also tree based classifiers, but perhaps this is a conversation for another time :)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.