What is the form of data used for prediction with generalized stacking ensemble?
I am very confused as to how training data is split and on what data level 0 predictions are made when using generalized stacking. This question is similar to mine, but the answer is not sufficiently clear:
How predictions of level 1 models become training set of a new model in stacked generalization.
My understanding is that the training set is split, base models trained on one split, and predictions are made on the other. These predictions now become features of a new dataset. One column for each model's prediction, plus a column that contains the ground truth for these predictions.
- Split training data into train/test.
- Train base models on training split.
- Make predictions on test split (according to linked answer, use k-fold CV for this).
- Create a feature for each model, filling it with that model's predictions
- Create a feature for the ground truth of those predictions.
- Create a new model and train it on these predictions and ground truth features.
Question 1: Are these the only features used to train the "meta" model? In other words, are none of the actual features of the original data included? The linked answer says it is common to include the original data, but I have not read about it elsewhere.
Question 2: If the above algorithm is correct, What is the form of the data when making predictions? It seems like it would also have to have predictions as independent variables. If so, that means running all new incoming data through all base models again, right?
Question 3: I keep seeing an "out-of-fold" requirement for the first level predictions. It seems that doing a simple train/test split as mentioned above would fulfill this. However, would you not want a 3rd split to test the combined model's generalization? Or is this type of ensemble bulletproof enough not to worry about it?
Topic generalization ensemble-learning ensemble-modeling
Category Data Science