Clustering based on features of varied importance

Question

Clustering based on features of varied importance

Gilad Felsen

2022年5月21日 05:06

Suppose I have a dataset that includes the following features {HairColor, EyeColor, EducationLevel, Income}. I would like to perform clustering to separate the dataset into smaller datasets that you would expect to behave similarly. The difficulty that arises is that it is clear that EducationLevel and Income are much more important than HairColor and EyeColor but I do not know how to measure that importance for the sake of clustering.

In the example below, I would want it to be clear that Row 1 is more similar to Row 3, than to Row 2.

ID	EyeColor	HairColor	EducationLevel	Income
1	1	1	1	1
2	1	1	2	2
3	2	2	1	1

Topic clustering

Category Data Science

Nikos M. · Accepted Answer · 2021年4月19日 16:08

One approach is to do dimensionality sampling, that is drop some features and see the resulting dataset that arises.

If there is some objective importance metric (eg Correlation, PCA) that quantifies the 2 features as more important, you can try that directly.

Else one can try iteratively to drop features and test the resulting dataset.

This approach does not mean that information is lost, it may even fit better if some features are simply noise.

Another related approach is to merge features in a way that maintains some hierarchy of importance.

For example create a new combined feature from 2 features $x_1$, $x_2$ as $x_{12} = x_1^n + x_2$ , which maintains that feature $x_1$ is more important than $x_2$ in the combined feature

Jayaram Iyer · Accepted Answer · 2021年4月19日 13:55

If education level and income are more important than other features, you can multiply those features by a factor (greater than 1). That will allow the clustering algo to focus more on those features than the rest.

For larger datasets, with features where the differences are not so obvious, you will need to rely on your judgement based on the end objective of the clustering. If you have a target variable in mind, you want to choose or boost those features that are highly correlated with the target

Clustering based on features of varied importance

About