Why is each successive tree in GBM fit on the negative gradient of the loss function?

Page 359 of Elements Of Statistical Learning 2nd edition says the below.

Can someone explain the intuition simplify it in layman terms?

Questions

  1. What is the reason/intuition math behind fitting each successive tree in GBM on the negative gradient of the loss function?
  2. Is it done to make GBM more generalization on unseen test dataset? If so how does fitting on negative gradient achieve this generalization on test data?

Topic loss-function gbm gradient-descent optimization machine-learning

Category Data Science


The negative gradient is used because a loss function is minimized. Moving in a "downward" is reducing the expected error.

The text is trying to explain the logic of boosting. Instead of fitting the model on all the training data at one time, train the model only a subsample of the training data at a time. Then, the model can learn from its errors on subsequent blocks of training data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.