Clustering based on features of varied importance

Suppose I have a dataset that includes the following features {HairColor, EyeColor, EducationLevel, Income}. I would like to perform clustering to separate the dataset into smaller datasets that you would expect to behave similarly. The difficulty that arises is that it is clear that EducationLevel and Income are much more important than HairColor and EyeColor but I do not know how to measure that importance for the sake of clustering.

In the example below, I would want it to be clear that Row 1 is more similar to Row 3, than to Row 2.

ID EyeColor HairColor EducationLevel Income
1 1 1 1 1
2 1 1 2 2
3 2 2 1 1

Topic clustering

Category Data Science


One approach is to do dimensionality sampling, that is drop some features and see the resulting dataset that arises.

If there is some objective importance metric (eg Correlation, PCA) that quantifies the 2 features as more important, you can try that directly.

Else one can try iteratively to drop features and test the resulting dataset.

This approach does not mean that information is lost, it may even fit better if some features are simply noise.

Another related approach is to merge features in a way that maintains some hierarchy of importance.

For example create a new combined feature from 2 features $x_1$, $x_2$ as $x_{12} = x_1^n + x_2$ , which maintains that feature $x_1$ is more important than $x_2$ in the combined feature


If education level and income are more important than other features, you can multiply those features by a factor (greater than 1). That will allow the clustering algo to focus more on those features than the rest.

For larger datasets, with features where the differences are not so obvious, you will need to rely on your judgement based on the end objective of the clustering. If you have a target variable in mind, you want to choose or boost those features that are highly correlated with the target

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.