Should I generalize categorical features if the algorithm handles over-fitting well?
I'm referring to Kaggle feature creation exercise . The data frame contains a column(MSSubClass) that contains these unique values:
'One_Story_1946_and_Newer_All_Styles',
'Two_Story_1946_and_Newer',
'One_Story_PUD_1946_and_Newer',
'One_and_Half_Story_Finished_All_Ages',
'Split_Foyer',
'Two_Story_PUD_1946_and_Newer',
'Split_or_Multilevel',
'One_Story_1945_and_Older',
'Duplex_All_Styles_and_Ages',
'Two_Family_conversion_All_Styles_and_Ages',
'One_and_Half_Story_Unfinished_All_Ages',
'Two_Story_1945_and_Older',
'Two_and_Half_Story_All_Ages',
'One_Story_with_Finished_Attic_All_Ages',
'PUD_Multilevel_Split_Level_Foyer',
'One_and_Half_Story_PUD_All_Ages'
and they generalize the values into following values:
'One', 'Two', 'Split', 'Duplex', 'PUD'
(by splitting from the first word).
Should this kind of generalization is needed if I only use Random forests as my algorithm to make predictions?
It seems this kind of generalization losses some amount of information from the data. Also random forests are good at handling over-fitting.
Topic generalization feature-engineering random-forest
Category Data Science