Using feature importance to decet latent variables and grouping

Is it possible to use feature importance from Random Forests (e.g. based on gini impurity) or other models to determine which features I can use to group the rows of my dataset homogeneously?

For example, let's say I have a dataset with N rows and p columns, (one of the columns used as the label in my training task). I train the model and I get a rank of the importance of my features. Only 5 features are more important than a random feature "artificially" added to my dataset (this is used to understand if some features do not add any "predictable power" to my model). Can I assume that using these 5 features as dimensions to group my rows, provide "homogeneous group w.r.t. my output"?

Apologies for the confusing description of my problem. Rather than a strict answer I'd like to be pointed to the right direction to make some research into this. For now this is solely based on my intuition and I could be totally off-road.

Is there any area of ML/Stats/etc. where this is done?

For example, Latent Class Analysis seems something similar to this, am I wrong?

Example:

I want to predict the probability of an person to drink more than 0.5 liter of beer in a month. I have information about each individual in the population such as: age, sex, geographical area (US state), height, weight, etc.

My output is

  • 0 = avg(liter of beer per mount) 0.5
  • 1 = avg(liter of beer per mount) >= 0.5

I train a model (let's say a Random forest) and the feature importance ranks says that "age", "weight" and "height" are useful in predicting the consumption of beer, while the other features do not add much to the predicting performance (or at least those are ranked as less important).

My assumption is that if I group the population if using the three "important" features these groups will share some similar characteristic, hence there's a latent variable which groups these individual together.

Groups:

1) Age: [21, 25], weight(Kg): [70, 75], height(cm): [170, 175]

2) Age: [21, 25], weight(Kg): [70, 75], height(cm): [175, 180]

3) etc.

Topic predictor-importance random-forest feature-selection clustering

Category Data Science


You might be able to segmentate your items into groups that share a beer consumption trait by focusing on those variables. It's a good idea if you started with a lot of columns . Of course, expect more meaningful groups if your model is simpler rather than complex - a small decision tree rather than a random forrest, for example.

Why don't you do some descriptive statistics on these columns, dividing first your data in the target variable? The visualization of the different patterns can be quite inspiring.

Also, test your intuition: cluster your data based solely on these columns and check whether each cluster is significantly biased towards one of the target values.

Something that you can also try is using the notion of proximity between data points you get from a random forrest model in your clustering algorithm.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.