Why is each successive tree in GBM fit on the negative gradient of the loss function?
Page 359 of Elements Of Statistical Learning 2nd edition says the below.
Can someone explain the intuition simplify it in layman terms?
Questions
- What is the reason/intuition math behind fitting each successive tree in GBM on the negative gradient of the loss function?
- Is it done to make GBM more generalization on unseen test dataset? If so how does fitting on negative gradient achieve this generalization on test data?
Topic loss-function gbm gradient-descent optimization machine-learning
Category Data Science