Stacking: How to best treat base learner?

With stacking, several (diverse) base learners are used to predict the dependent variable $\hat{y}_{b,m}=\beta_{b,m} X$ in a hold-out set, where $m$ are base learner models $1,...,n$. These predictions are used in a second step as explanatory variable(s) in a meta learner $y = \beta_1 X + \beta_2 \hat{y}_b + u$.

I wonder how to best treat $\hat{y}_{b,m}$ in practice. There are basically two options:

  • Use each base learner's prediction $\hat{y}_{b,m}$ as a separate feature (column) in the meta learner model.
  • Take the rowmean over the different base learner's predictions $ 1/n \sum \hat{y}_{b,m}$ and use it as a single feature (column) in the meta learner.

My intuition is that both approaches might work, dependent on the choice of the meta learner. E.g. when the meta learner uses shrinkage (e.g. Ridge), this may help to shrink some of the not so useful $\hat{y}_{b,m}$ when all of the base learners predictions are treated as single feature (OK, correlation might be an issue in linear models). A similar logic might apply to meta learners such as boosted trees (correlation not a big issue). Using each prediction as a single feature may also provide more information (variation in the data) which can be exploited by the meta learner.

Nevertheless, averaging seems to be used quite often (if I'm not mistaken) to create a single feature from different base learners's predictions. I can't really pin down what is be the best approach here.

Are there any insights - theory based or from practical experiance - which help to decide what the best approach is?

Topic stacking machine-learning-model

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.