Categorical variables: create a risk class or include in the model?
I think this is a very basic question so sorry for the wordy format. I am trying to get my head around it.
I am thinking about predicting earthquake damage to property in the US using a GLM algorithm. I start with my predictor data, say: State (categorical), Owner wealth bracket (discrete), Bedrooms in property (discrete), Earthquake resistance number (discrete), and my response variable: Claim amount in a given year (continuous).
I may decide at the beginning to split my dataset by State, and then run an algorithm on each subset of the data as predictors. Maybe I want an individual model per state. Alternatively, I could split my dataset in an even more fine-grained way and classify based on State and Owner wealth bracket and then run the algorithm on the remaining predictors. Maybe I only want to start the model after digging down that far.
When I go to predict the Claim amount from future data, I either take my new property and put it only in the State category, and then give the expected Claim value, or I classify it into State and Owner wealth bracket and then give the expected value.
What are the relative merits of each approach? So far, my main idea is that the more fine-grained classification is going to reduce the size of the datasets over which I may run the algorithm. Is one way just simply better?
Topic predictor-importance machine-learning-model categorical-data
Category Data Science