Is normalization needed for TargetEncoded Variables?

Basically the title.

If I encode the address of people (the cities they live in) with a target encoder, do I still need to normalize that column? Of course, the capital is going to have more citizens and bigger cities also, so it looks kinda like an exponential distribution. In such a case, is normalization still needed (via a log transform for example), or are target encoded variables enough? Why?

Thank you!

Topic target-encoding normalization data-cleaning

Category Data Science


(I can't leave comments yet so I'll have to write an answer instead.) It depends on what kind of model you're using. And more specifically...

If you're doing using something like logistic regression, or you're using neural networks with non-linear activations (ReLU, tanh, sigmoid, softmax especially), you absolutely need to do some type of normalization/standardization on all the features you decide to incorporate because the loss functions and their gradients are susceptible to unfriendly behavior.

Another class of models you would need to normalize/standardize data in are in clustering models, because those rely on the notion of a metric for the partitioning criteria. If you are incorporating your categorical variables into a clustering algorithm you had better normalize them.

On the flipside, if you're doing something like random forest or decision trees (also applies to gradient boosted analogues), it's not necessary to normalize your data because the partitioning criteria in those models is almost always independent of the scale of the data.

So the answer is, it depends on the model, and it depends on how the model partitions the inputs to achieve the output (using a loss function, discrete conditions to send down branches of a tree, etc.)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.