Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing:

In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given about sklearn.preprocessing.LabelEncoder(), when I checked their functionality it looked same to me. Can Someone please tell me the difference between the two please?

Topic preprocessing encoding scikit-learn python machine-learning

Category Data Science


As for differences in OrdinalEncoder and LabelEncoder implementation, the accepted answer mentions the shape of the data:

  • OrdinalEncoder is for 2D data with the shape (n_samples, n_features)
  • LabelEncoder is for 1D data with the shape (n_samples,)

Maybe that's why the top-voted answer suggests OrdinalEncoder is for the "features" (often a 2D array), whereas LabelEncoder is for the "target variable" (often a 1D array).

That's also why a OrdinalEncoder would get an error if trying to fit on 1D data: OrdinalEncoder().fit(['a','b'])

ValueError: Expected 2D array, got 1D array instead:

Another difference between the encoders is the name of their learned parameter;

  • LabelEncoder learns classes_
  • OrdinalEncoder learns categories_

Notice the differences when fitting LabelEncoder vs OrdinalEncoder, and the differences in the values of the learned parameters.

  • LabelEncoder.fit(...) accepts a 1D array; LabelEncoder.classes_ is 1D
  • OrdinalEncoder.fit(...) accepts a 2D array; OrdinalEncoder.categories_ is 2D.
    LabelEncoder().fit(['a','b']).classes_
    # >>> array(['a', 'b'], dtype='<U1')
    
    OrdinalEncoder().fit([['a'], ['b']]).categories_
    # >>> [array(['a', 'b'], dtype=object)]

This is consistent with the idea that

LabelEncoder should be used to encode target values, i.e. y, and not the input X.

Other encoders that work in 2D, including OneHotEncoder, also use the property categories_

More info here about the dtype <U1 (little-endian , Unicode, 1 byte; i.e. a string with length 1)

EDIT

In the comments to my answer, Piotr disagrees with my answer; but Piotr points out the difference between ordinal encoding and label encoding more generally (vs differences in their implementation). Piotr's right about the general definitions/usages:

  • Ordinal encoding should be used for ordinal variables (where order matters, like cold, warm, hot);
  • vs Label encoding should be used for non-ordinal (aka nominal) variables (where order doesn't matter, like blonde, brunette)

This is a good point, but this question asks about the sklearn classes/implementation. If you want ordinal encoding like Piotr describes (i.e. where order is preserved); you must do the ordinal encoding yourself (neither OrdinalEncoder nor LabelEncoder can infer the order! See the OrdinalEncoder constructor parameter called categories).

As for implementation it seems like LabelEncoder and OrdinalEncoder have consistent behavior as far as the chosen integers. They both assign integers based on alphabetical order. For example:

OrdinalEncoder().fit_transform([['cold'],['warm'],['hot']]).reshape((1,3))
# >>> array([[0., 2., 1.]])

LabelEncoder().fit_transform(['cold','warm','hot'])
# >>> array([0, 2, 1], dtype=int64)

Notice how both encoders assigned integers in alphabetical order 'c'<'h'<'w'.

But this part is important: Notice how neither encoder got the "real" order correct (i.e. the real order should reflect the temperature, where order is 'cold'<'warm'<'hot'; 0<1<2). If the encoders used the "real" order, the value 'warm' would have been assigned the integer 1 (instead of the integer 2)

In the blog post referenced by Piotr, the author does not even use OrdinalEncoder(). To achieve ordinal encoding the author does it manually: maps each temperature to a "real" order integer, using a dictionary like {'cold':0, 'warm':1, 'hot':2}:

Refer to this code using Pandas, where first we need to assign the real order of the variable through a dictionary... Though its very straight forward but it requires coding to tell ordinal values and what is the actual mapping from text to integer as per the order.

In other words, if you're wondering whether to use OrdinalEncoder, please note OrdinalEncoder may not actually provide "ordinal encoding" the way you expect!

EDIT @Magnus Persson pointed out that the OrdinalEncoder class accepts an argument called categories, which you can use to determine/assign the resulting order.

OrdinalEncoder(categories=[['cold','warm','hot']])
    .fit_transform([['hot'],['warm'],['warm'],['cold']])
    .reshape((1,-1))[0]

# Output is:
# >>> array([[2., 1., 1., 0.]])    

EDIT @lbcommer pointed out that there is a Python library category_encoders, which has an OrdinalEncoder class. Note how even that class constructor has a mapping argument so you can choose the resulting order:

the value of ‘mapping’ should be a dictionary of ‘original_label’ to ‘encoded_label’.... example mapping: {‘col’: ‘col1’, ‘mapping’: {None: 0, ‘a’: 1, ‘b’: 2}}, {‘col’: ‘col2’, ‘mapping’: {None: 0, ‘x’: 1, ‘y’: 2}}


You use ordinal encoding to preserve order of categorical data i.e. cold, warm, hot; low, medium, high. You use label encoding or one hot for categorical data, where there's no order in data i.e. dog, cat, whale. Check this post on medium. It explains these concepts well.


Afaik, both have the same functionality. A bit difference is the idea behind. OrdinalEncoder is for converting features, while LabelEncoder is for converting target variable.

That's why OrdinalEncoder can fit data that has the shape of (n_samples, n_features) while LabelEncoder can only fit data that has the shape of (n_samples,) (though in the past one used LabelEncoder within the loop to handle what has been becoming the job of OrdinalEncoder now)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.