How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?

I have been exploring clustering algorithms (K-Means, K-Medoids, Ward Agglomerative, Gaussian Mixture Modeling, BIRCH, DBSCAN, OPTICS, Common Nearest-Neighbour Clustering) with multidimensional data. I believe that the clusters in my data occur across different subsets of the features rather than occurring across all features, and I believe that this impacts the performance of the clustering algorithms.

To illustrate, below is Python code for a simulated dataset:

## Simulate a dataset.

import numpy as np, matplotlib.pyplot as plt
from sklearn.cluster import KMeans

np.random.seed(20220509)

# Simulate three clusters along 1 dimension.

X_1_1 = np.random.normal(size = (1000, 1)) * 0.10 + 1
X_1_2 = np.random.normal(size = (2000, 1)) * 0.10 + 2
X_1_3 = np.random.normal(size = (3000, 1)) * 0.10 + 3

# Simulate three clusters along 2 dimensions.

X_2_1 = np.random.normal(size = (1000, 2)) * 0.10 + [4, 5]
X_2_2 = np.random.normal(size = (2000, 2)) * 0.10 + [6, 7]
X_2_3 = np.random.normal(size = (3000, 2)) * 0.10 + [8, 9]

# Combine into a single dataset.

X_1 = np.concatenate((X_1_1, X_1_2, X_1_3), axis = 0)
X_2 = np.concatenate((X_2_1, X_2_2, X_2_3), axis = 0)

X = np.concatenate((X_1, X_2), axis = 1)

print(X.shape)

Visualize the clusters along dimension 1:

plt.scatter(X[:, 0], X[:, 0])

Visualize the clusters along dimensions 2 and 3:

plt.scatter(X[:, 1], X[:, 2])

K-Means with all 3 Dimensions

K = KMeans(n_clusters = 6, algorithm = 'full', random_state = 20220509).fit_predict(X) + 1

Visualize the K-Means clusters along dimension 1:

plt.scatter(X[:, 0], X[:, 0], c = K)

Visualize the K-Means clusters along dimensions 2 and 3:

plt.scatter(X[:, 1], X[:, 2], c = K)

The K-Means clusters developed with all 3 dimensions are incorrect.

K-Means with Dimension 1 Alone

K_1 = KMeans(n_clusters = 3, algorithm = 'full', random_state = 20220509).fit_predict(X[:, 0].reshape(-1, 1)) + 1

Visualize the K-Means clusters along dimension 1:

plt.scatter(X[:, 0], X[:, 0], c = K_1)

The K-Means clusters developed with dimension 1 alone are correct.

K-Means with Dimensions 2 and 3 Alone

K_2 = KMeans(n_clusters = 3, algorithm = 'full', random_state = 20220509).fit_predict(X[:, [1, 2]]) + 1

Visualize the K-Means clusters along dimensions 2 and 3:

plt.scatter(X[:, 1], X[:, 2], c = K_2)

The K-Means clusters developed with dimensions 2 and 3 alone are correct.

Clustering Between Dimensions

Although I did not intend for dimension 1 to form clusters with dimensions 2 or 3, it appears that clusters between dimensions emerge. Perhaps this might be part of why the K-Means algorithm struggles when developed with all 3 dimensions.

Visualize the clusters between dimension 1 and 2:

plt.scatter(X[:, 0], X[:, 1])

Visualize the clusters between dimension 1 and 3:

plt.scatter(X[:, 0], X[:, 2])

Questions

  1. Am I making a conceptual error somewhere? If so, please describe or point me to a resource. If not:

  2. If I did not intend for dimension 1 to form clusters with dimensions 2 or 3, why do clusters between those dimensions emerge? Will this occur with higher-dimensional clusters? Is this why the K-Means algorithm struggles when developed with all 3 dimensions?

  3. How can I select the different subsets of the features where different clusters occur (3 clusters along dimension 1 alone, and 3 clusters along dimensions 2 and 3 alone, in the example above)? My hope is that developing clusters separately with the right subsets of features will be more robust than developing clusters with all features.

Thank you very much!

UPDATE:

Thank you for the very helpful answers for feature selection and cluster metrics. I have asked a more specific question: Why Do a Set of 3 Clusters Across 1 Dimension and a Set of 3 Clusters Across 2 Dimensions Form 9 Apparent Clusters in 3 Dimensions?

Topic feature-selection python k-means clustering

Category Data Science


The essential thing you are doing wrong is the selected number of clusters.

You have chosen 6 when you defined in your dataset 3.

So by simply changing to this:

from this:

K = KMeans(n_clusters = 6, algorithm = 'full', random_state = 20220509).fit_predict(X) + 1

To:

K = KMeans(n_clusters = 3, algorithm = 'full', random_state = 20220509).fit_predict(X) + 1

You get:

plt.scatter(X[:, 1], X[:, 2], c = K)

enter image description here

and this:

plt.scatter(X[:, 0], X[:, 0], c = K)

enter image description here

How can I select the different subsets of the features where different clusters occur (3 clusters along dimension 1 alone, and 3 clusters along dimensions 2 and 3 alone, in the example above)? My hope is that developing clusters separately with the right subsets of features will be more robust than developing clusters with all features.

You can try kind of random feature removal by randomly shuffling a feature and fitting your clusters and then calculating a separateness measure like silhouette and see which one gives you the highest decrease (the higher, the more important a feature is to form the clusters) Repeat this step n times for every feature.

This would be equivalent to Permutation importance (check)


The field of feature selection for clustering studies this topic.

A specific algorithm for feature selection for clustering is Spectral Feature Selection (SPEC) which estimates the feature relevance by estimating feature consistency within the spectrum matrix of the similarity matrix. The features consistent with the graph structure will have similar values to instances that are near to each other in the graph. These features should be more relevant since they behave similarly in each similar group of samples, aka clusters.

"Feature Selection for Clustering: A Review" by Alelyani et al. goes into greater detail. There is an also an Feature Selection for Clustering Python package.


  1. Firstly, embrace the fact that clustering is unsupervised. Supervise learning algorithms can do feature selection because they know what they look for: features that are relevant to target. This is not the case for clustering - it just tries to make sense of everything you feed in.

    Say I have a dataset of 100 people with 3 attributes: age, weight, and wealth. In my mind, I may expect clusters of 30 "elder fat gentleman" and 70 "young healthy boy" (using only age and weight); but the clustering algorithm returns 1 "old filthy fat ass" versus 99 "other lowborn" (using all 3 features). Is the algorithm wrong? No! We are just looking at same thing from different aspects.

    Key message is there is no absolute right or wrong in unsupervised learning.

  2. If I did not intend for dimension 1 to form clusters with dimensions 2 or 3, why do clusters between those dimensions emerge?

    Things we do not intend for do not imply it cannot exist. Try drawing a 3D plot with all 3 dimensions together plus cluster colors; the clustering algorithm may be seeing structure in 3D that is invisible in 2D.

  3. There are some metrics that evaluate the "goodness" of clusters, e.g. Silhouette Score, Rank Index etc. So maybe we can formulate it as an optimization problem - try to optimize a metric w.r.t. feature selection. I am not very familiar with this, but sounds like framing into a supervised problem.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.