When is the sum of models the model of the sum?

The response variable in a regression problem, $Y$, is modeled using a data matrix $X$.

In notation, this means:

$Y$ ~ $X$

However, $Y$ can be separated out into different components that can be modeled independently.

$$Y = Y_1 + Y_2 + Y_3$$

Under what conditions would $M$, the overall prediction, have better or worse performance than $M_1 + M_2 + M_3$, a sum of individual models?

To provide more background, the model used is a GBM. I was surprised to find that training a model for a specific $Y_i$ resulted in about equal performance than using the overall model $M$ to predict that $Y_i$. The $Y_i$'s are highly correlated. In hindsight, then this is not surprising because training a model for a vector correlated with the target also is correlated with the target.

For analogy, take the case with a linear model and independent response variables.

The overall model is

$Y = X\beta$

It is trivial to see that the sum of the models is the model of the sum.

$Y = X\beta = X\beta_1 + X\beta_2 + X\beta_3 = X(\beta_1 + \beta_2 + \beta_3)$

If the $Y$'s are independent then the $\beta$'s will be as well. This implies that each of the model coefficients will be unchanged. Take for example a two-dimensional case (where $X$ has two columns).

For $i \neq j$, $Y_i = X(\beta_i + \beta_j) = X\beta_j + 0$

Topic mathematics supervised-learning regression statistics

Category Data Science


The two models are not equivalent in general, either it might happen they provides similar results. There are multiple issues here. What you have actually in $Y$ is something like: $$Y = \tilde{Y} + \epsilon = \tilde{Y_1} + \epsilon_1 + \tilde{Y_2} + \epsilon_2 + \tilde{Y_3} + \epsilon_3$$ where $\tilde{Y}$ and $\tilde{Y_i}$ are the true functions which you want to approximate and $\epsilon$ and $\epsilon_i$ are errors/noise.

One issue is how do you split output $Y$ into components. Supposing you have knowledge about components of $Y$ and you split it into simpler functions which are easier to fit independently. However in practice you have a sample and each observation has errors, you would have to split also the error components, which is not easy since most of the time the assumption used is that the errors are at least independent, if not identical distributed also.

The second one is the variance of the models. In the analogy with linear equations you stated $$Y = X\beta = X\beta_1 + X\beta_2 + X\beta_3 = X(\beta_1 + \beta_2 + \beta_3)$$ Well, this is actually true only on expectation, aka: $$EY = X\beta = X\beta_1 + X\beta_2 + X\beta_3 = X(\beta_1 + \beta_2 + \beta_3)$$ But your predictions have also variances, which can put you away from the true functions. If we assume the noise for each component is mutually independent, than the variances accumulates, giving an inflated variance for the summed model $$Var(\epsilon)\leq Var(\epsilon_1) + Var(\epsilon_2) + Var(\epsilon_3)$$ If the component errors are not independent than you can also add the covariances to the issue and you can easily go into trouble.

As a conclusion if you have three sets of observations for each $Y_i$ than it should be usually better to fit them separately rather than summed. But if you have the summed data set it is usually much harder to split data and the results have better chances to be more unstable, even if on expectation you have the same target.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.