How to optimize hyperparameters in stacked model?

I was wondering whether somebody could explain how to optimize hyperparameters for the base learners and meta algorithm when stacking? In many tutorials they seem to be plucked out of thin air!

Thanks,

Jack

Topic meta-learning ensemble-modeling hyperparameter machine-learning

Category Data Science


I believe the most common way involves some slight data leakage during the training step that is often ignored. The "correct" way involves giving up more training data but many have empirically realized that giving up more training data is often worse.

  1. Split your data into training and testing.
  2. Split the training set into k-folds.
  3. Train your base models using all of the k folds. Save the predictions from the best (optimized in terms of hyperparameters) base models, for each of the k folds.

Example: suppose we use 3 fold cross validation and have two base models. For each base model, we find the best hyperparameters that on average optimize some loss function over these three folds. We then save the predictions made by this optimal base model over these same three folds (i.e. we save the exact same predictions that you must have made when scoring your model over all three folds).

In essence, you are transforming your original training set with your base models in pieces and reconstructing it back together.

  1. Using your reconstructed training set of predictions from folds 1, 2 and 3, use k fold cross validation again to train your combiner (i.e. find optimal hyperparameters again but for the combiner). You can choose to use the same splits as you did in step three but it doesn't matter. This is where the problem lies with data leakage. No matter how you split your data here you will have variables in your test set that were directly created from observations in your training set. For example, suppose fold 1 and fold 2 in the previous example make up the training set, and therefore fold 3 is used for validation. Since both fold 1 and fold 2 were created using fold 3 observations there is a chance that you overfit here due to optimism in your validation scores. In Kaggle competitions this is often ignored. My guess is probably because if you are stacking you have a large dataset, and if you have a large dataset minor data leaks become less of an issue (in general).

  2. Fit your base models using your entire training dataset, using the hyperparameters found in step 3. Predict the training set to generate the required meta features for your combiner, and also predict the test set created in step 1.

  3. Fit your combiner using the training set generated in the previous step, using the hyperparameters found in step 4.

  4. Predict the test set that you had created from your base models in step 5, using the combiner model from the prior step. Score your loss and report this score as your models performance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.