Should I generalize categorical features if the algorithm handles over-fitting well?

Question

Should I generalize categorical features if the algorithm handles over-fitting well?

Rumesh Madhusanka

2021年7月24日 13:54

I'm referring to Kaggle feature creation exercise . The data frame contains a column(MSSubClass) that contains these unique values:

   'One_Story_1946_and_Newer_All_Styles', 
   'Two_Story_1946_and_Newer',
   'One_Story_PUD_1946_and_Newer',
   'One_and_Half_Story_Finished_All_Ages', 
   'Split_Foyer',
   'Two_Story_PUD_1946_and_Newer', 
   'Split_or_Multilevel',
   'One_Story_1945_and_Older', 
   'Duplex_All_Styles_and_Ages',
   'Two_Family_conversion_All_Styles_and_Ages',
   'One_and_Half_Story_Unfinished_All_Ages',
   'Two_Story_1945_and_Older', 
   'Two_and_Half_Story_All_Ages',
   'One_Story_with_Finished_Attic_All_Ages',
   'PUD_Multilevel_Split_Level_Foyer',
   'One_and_Half_Story_PUD_All_Ages'

and they generalize the values into following values:

'One', 'Two', 'Split', 'Duplex', 'PUD'

(by splitting from the first word).

Should this kind of generalization is needed if I only use Random forests as my algorithm to make predictions?

It seems this kind of generalization losses some amount of information from the data. Also random forests are good at handling over-fitting.

Topic generalization feature-engineering random-forest

Category Data Science

Should I generalize categorical features if the algorithm handles over-fitting well?

About