Gradient boosting algorithms and filling categorical variables

Question

Gradient boosting algorithms and filling categorical variables

Vasilii Naumushkin

2021年10月2日 18:39

I have house prices dataset Link on Kaggle and I am having some dilemma. Some categorical variables having explicit majority. If we look at MSZoning and SaleType columns, there is RL type meets 91% of values for MSZoning and WD meets 87% of time for SaleType respectively. Before I apply ecnodings or labelings, I need to decide, fill missing values with None or fill them with mode. In other words, if we pretty certain that some type of data will appear pretty regularly ~90% of times, will mode be more effective for filling missing values or this decision will lead to poorer building of trees and more errors?

Question can be extended to boosting algorithms and algorithms based on decision trees.

Topic boosting missing-data xgboost

Category Data Science

Gradient boosting algorithms and filling categorical variables

About