Should one log transform discrete numerical variables?

Question

Should one log transform discrete numerical variables?

spectre

2021年11月17日 15:28

I am working on a Linear Regression problem and one of the assumptions of a Linear Regression model is that the features should be Normally Distributed. Hence to convert my non linear features to linear, I am performing several transformations like log, box-cox, square-root transformation etc. I have both, discrete and continuous numerical variables (an example of each along with their histograms and qq plot is given):

CONTINUOUS VARIABLE HISTOGRAM AND QQ PLOT

DISCRETE VARIABLE HISTOGRAM AND QQ PLOT

From the qq plot of the continuous variable, we can see there are points that do no lie on the red line and hence it needs some kind of transformation. So I might try different transformations to see which results in a Normal Distribution and hence make the points fall on the red line.

But what about the discrete variable? From the qq plot of the discrete variable, all the points are forming a horizontal line so will transforming them make them fall on the red line? Should I proceed the same as I do in the case of a continuous variable, or is there some other method?

Topic transformation feature-engineering linear-regression python

Category Data Science

rapaio · Accepted Answer · 2021年11月17日 15:28

First of all in standard linear regression there is no assumption of normality for the features. More than that, standard linear regression, known also as fixed effects linear regression is a linear regression model where the input variables are given, so they are not random variables. Under that model only the target variable is a random variable $$y = X\beta + \epsilon, \epsilon \sim \cal{N}(0,\sigma^2) $$ There are quite a few assumptions for a linear model like independent and heteroskedastic noise, additive features and so on. There are times you need to transform features to meet those assumptions. I enumerate some cases I encountered:

Your model is linear (yes, I know, it sounds redundant, but it is not). The point is that the input variables should be additive to form hyperplanes (in 2 dimensions this is a line), otherwise your model don't work. Imagine you collect observations about gravitational attraction which comes from formula: $F_G = -\frac{G m_1 m_2}{r^2}$ (inverse square law - Force is proportional to the product of the masses and inversely proportional to the square of the distance between them). And you want to regress $F_G$ as a linear model from input variables $m_1$, $m_2$, $r$ ($G$ is a constant). Obviously modeling that like $F_G = \beta_0 + \beta_1 m_1 + \beta_2 m_2 + \beta_3 r + \epsilon$ will not work at all, it will be wild. But taking logarithms from all variables involved your data will be linearly additive. Most of the time you do not know the laws which governs your data, but with careful inspection of the relations between your input variables you could eventually get them to be in good shape for a linear model. For that you should see how more inputs correlates and contributes not each feature independently
heteroskedastic noise - this means constant variance of thee error, or in plain English the variance of the error should not depend input variables. Imagine that your output variable is a volume, something like $v^d$, where $d$ is the dimensionality of the space. Now an error of 1 centimeter for a cube of side length 10 is much lower than the error in volume for a cube of side 100. Basically the observed variance should increase significantly for large values of target variables. Again, most of the time you might not know "laws" about your variables, but graphical inspection can help a lot to isolate such kind of deviations and adjustments could be made. In the example I gave again a logarithmic transformation (even box-cox, power transform, etc) could help you a lot.
discrete variables - here you have different options. You could have an ordinal variable here like temperature low, medium or high. Those cases could be encoded like a numerical column but you have to pay attention with the values you have for each level. Those values will have implications in the value of the coefficients and the values you gave should have some sense. If you discrete variables encodes mutually exclusive factors like eye color: brown, blue, green, whatever. You better encode those as binary variables (one less since otherwise the regression will not work due to impossibility of matrix inversion). Now the above discussion discusses cases when your discrete variables are given as factors (text eventually). But this applies also to numeric encodings also. If the color is encoded as 1, 2, 3, .., instead of strings you should transform that into binary variables if they are nominal factors. If you have an ordinal variable you perhaps could leave it as such, other than the case when you have a clue regarding better proper values. I will try to give here another example. Supporse you have encoded numerically a magnitude of an earthquake and you have something like 1 for (1-3 richter), 2 for (4-6 richter), 3 for (7-9 richter) and 4 for a catastrophic one. You could maintain the same values or maybe you could try to use the fact that richer scale is an exponential scale, 3 richer degree is 10 times smaller than 4 richer degrees (or similar, I do not remember precisely, but you can get the idea). In that case you could substitute those values with $10^i$ instead of $i$ for a better alignment with the linear model.

Some things could not be repaired. For example if your data comes from a time series where you have a clear and string dependence of observations from past observations, such kind of problem could not be easily solved, and perhaps you should take another approach anyway, since fixed effects linear models are not recommended for such cases.

As a conclusion, you could study more the assumptions of linear regression, try to understand what could you do to check, or at least to inspect and study if those assumptions are met, and see if it could be corrected reasonably. This should be your target, to make your data aligned with the linear model assumptions, if this is the model you want to use. All transformations of data should be governed by this idea.

And of course, please remember what you have done to transform the data, to apply the same to future predictions. Eventually to invert the target transformations if you want results in the original target distribution.

[later edit]: I added some ideas on discrete variables.

Should one log transform discrete numerical variables?

About