Using feature importance to decet latent variables and grouping
Is it possible to use feature importance from Random Forests (e.g. based on gini impurity) or other models to determine which features I can use to group the rows of my dataset homogeneously?
For example, let's say I have a dataset with N rows and p columns, (one of the columns used as the label in my training task). I train the model and I get a rank of the importance of my features. Only 5 features are more important than a random feature "artificially" added to my dataset (this is used to understand if some features do not add any "predictable power" to my model). Can I assume that using these 5 features as dimensions to group my rows, provide "homogeneous group w.r.t. my output"?
Apologies for the confusing description of my problem. Rather than a strict answer I'd like to be pointed to the right direction to make some research into this. For now this is solely based on my intuition and I could be totally off-road.
Is there any area of ML/Stats/etc. where this is done?
For example, Latent Class Analysis seems something similar to this, am I wrong?
Example:
I want to predict the probability of an person to drink more than 0.5 liter of beer in a month. I have information about each individual in the population such as: age, sex, geographical area (US state), height, weight, etc.
My output is
- 0 = avg(liter of beer per mount) 0.5
- 1 = avg(liter of beer per mount) >= 0.5
I train a model (let's say a Random forest) and the feature importance ranks says that "age", "weight" and "height" are useful in predicting the consumption of beer, while the other features do not add much to the predicting performance (or at least those are ranked as less important).
My assumption is that if I group the population if using the three "important" features these groups will share some similar characteristic, hence there's a latent variable which groups these individual together.
Groups:
1) Age: [21, 25], weight(Kg): [70, 75], height(cm): [170, 175]
2) Age: [21, 25], weight(Kg): [70, 75], height(cm): [175, 180]
3) etc.
Topic predictor-importance random-forest feature-selection clustering
Category Data Science