Input standartization for Deep Learning - Proper Scaling

Question

Input standartization for Deep Learning - Proper Scaling

Dimka Kopitkov

2022年5月18日 01:02

Typically the input to neural network (NN) is transformed to have zero mean and 1 std.

I wonder why std scale should be 1? What about other scales? 10? 100? Doesn't it make sense to provide NN with input of wider range so that NN can separate different clusters easier and deal with loss function for each cluster in more simple and robust way? Did someone here tried different scales and can share his experience?

If answer depends on the activation function - in my case I use Relu.

Thanks a lot!

Topic feature-scaling deep-learning

Category Data Science

mapto · Accepted Answer · 2018年8月3日 12:58

First, it is important to keep in mind that neural networks (like many other machine learning algorithms) work over the domain of the real numbers, i.e. they assume for granted the properties of this space. Some properties that are relevant to this context:

additive identity is 0 and multiplicative identity is 1
there are inherent (intuitive to us) total order and a standard distance

A side note on the implications of this: if one is planning to use neural networks for categorical or ordinal data, one needs to consider what mapping into the space of real numbers is reasonable and whether the standard mapping (e.g. to natural numbers or via string similarity) makes sense in the context of the problem.

From theoretic viewpoint scaling is a homogeneous transformation, i.e. it preserves the properties of the space up to a scalar, thus it doesn't matter what you choose. Since 1 is the multiplicative identity (nothing changes when you multiply by one) it makes sense to use it as a scale, because it simplifies calculations.

Another side note: This is actually valid also in a more general context. For example the machine representation of floating point numbers is via the so-called scientific notation. From this perspective, 0.01, 0.1, 1, 10, 100,... all have the same significand and only the exponent changes. Thus, the difference between the examples you mention is actually quite small.

Input standartization for Deep Learning - Proper Scaling

About