Why Do a Set of 3 Clusters Across 1 Dimension and a Set of 3 Clusters Across 2 Dimensions Form 9 Apparent Clusters in 3 Dimensions?

I am sorry if this is a well-known phenomenon but I can't quite wrap my head around this. I have a related question: How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?. There are good answers for feature selection and cluster metrics but I think this phenomenon deserves special attention.

I have simulated 3 clusters along 1 dimension, and then simulated 3 clusters along 2 dimensions, and then combined them into a dataset with all 3 dimensions. My hope was that cluster algorithms would identify the 3 clusters along dimension 1 and the 3 clusters along dimensions 2 and 3, for a total of 6 clusters. The cluster algorithms do not correctly identify the 6 clusters.

When I visualize the simulated data in 3 dimensions, there are 9 apparent clusters instead of the 6 that I simulated. Can someone explain why two sets of independent, lower-dimensional clusters form apparent clusters in a higher-dimensional space? I am concerned about the impact of this phenomena when developing cluster models with real data if independent clusters along subsets of dimensions form apparent but presumably misleading clusters in higher dimensions.

UPDATE: lpounng has described how actual clusters can result in apparent clusters. I am adding a bounty in the hopes that someone can describe this problem more canonically and perhaps describe a solution. Consider another example. I have simulated 2 clusters: persons with high blood sugar and high blood pressure, and persons with normal blood sugar and normal blood pressure. I have simulated 3 other unrelated clusters: persons with no injuries, a medium number of injuries, and a high number of injuries.

There are 5 actual clusters and 6 apparent clusters. KMeans finds the 6 apparent clusters correctly. The problem is that the KMeans clusters misleadingly imply that blood sugar, blood pressure, and injury cluster together. Is there a solution to this problem? Brian Spiering recommended the https://github.com/danilkolikov/fsfc library but I can't get the algorithms to distinguish the actual clusters from the apparent clusters.


b_hh = np.random.normal(size = (2000, 2)) + [10, 150] # High blood sugar and high blood pressure cluster.
b_ll = np.random.normal(size = (4000, 2)) + [ 2, 100] # Normal blood sugar and normal blood pressure cluster.

b = np.concatenate((b_hh, b_ll), axis = 0)


i_h = np.random.normal(size = ( 100, 1)) + 30 # High injury cluster.
i_m = np.random.normal(size = ( 900, 1)) + 15 # Medium injury cluster.
i_l = np.random.normal(size = (5000, 1)) +  0 # No injury cluster.

i = np.concatenate((i_h, i_m, i_l), axis = 0)


X = np.concatenate((b, i), axis = 1)



import numpy as np, matplotlib.pyplot as plt, plotly.graph_objects as go, plotly.io as pio
pio.renderers.default = 'browser'

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

Function to plot in 3 dimensions with plotly:

def c_3D(algorithm, data, o = 0.25, x_name = 'X Axis', y_name = 'Y Axis', z_name = 'Z Axis'):
    m = algorithm

    traces = []

    for i in np.unique(m):
        trace = go.Scatter3d(
                x = data[m == i, 0], y = data[m == i, 1], z = data[m == i, 2], 
                name = 'Cluster ' + str(i), 
                mode = 'markers', marker = dict(size = 5, opacity = o, color = i))

    layout = go.Layout(autosize = False, width = 1000, height = 1000, margin = dict(l = 0, r = 0, b = 0, t = 0),
             scene = dict(xaxis_title = x_name, yaxis_title = y_name, zaxis_title = z_name))

    fig = go.Figure(data = traces, layout = layout)


Simulate data:


# Simulate 3 clusters along 1 dimension.

X_1, Y_1 = make_blobs(n_samples = 5000, n_features = 1, centers = 3, cluster_std = 0.3)

# Simulate 3 clusters along 2 dimensions.

X_2, Y_2 = make_blobs(n_samples = 5000, n_features = 2, centers = 3, cluster_std = 0.3)

# Combine dimensions.

X = np.concatenate((X_1, X_2), axis = 1)


Visualize the 3 clusters along dimension 1:

plt.scatter(X[:, 0], X[:, 0])

Visualize the 3 clusters along dimensions 2 and 3:

plt.scatter(X[:, 1], X[:, 2])

Visualize the clusters in 3 dimensions:

def SetColor(c):
    if c == 0: return 'black'

c_3D(np.array(list(map(SetColor, np.zeros(X.shape[0])))), X)

Topic python dimensionality-reduction clustering

Category Data Science

It appears that clusters can form geometrically in higher-dimensional space with any dimensions that have clusters in lower-dimensional spaces. These apparent clusters may not reflect the actual clustering processes.

I have been able to get the results I expect with the idea that dimensions with actual clusters should correlate with each other. I apply clustering algorithms to those subsets of the dimensions that correlate with each other.

Simulate blood sugar, blood pressure, and injury clusters:


b_hh = np.random.normal(size = (2000, 2)) + [10, 150] # High blood sugar and high blood pressure cluster.
b_ll = np.random.normal(size = (4000, 2)) + [ 2, 100] # Normal blood sugar and normal blood pressure cluster.

b = np.concatenate((b_hh, b_ll), axis = 0)


i_h = np.random.normal(size = ( 100, 1)) + 30 # High injury cluster.
i_m = np.random.normal(size = ( 900, 1)) + 15 # Medium injury cluster.
i_l = np.random.normal(size = (5000, 1)) +  0 # No injury cluster.

i = np.concatenate((i_h, i_m, i_l), axis = 0)


X = np.concatenate((b, i), axis = 1)

Compute correlation coefficients between dimensions:

from scipy.stats import pearsonr, spearmanr

print(pearsonr(X[:, 0], X[:, 1]))
print(pearsonr(X[:, 0], X[:, 2]))
print(pearsonr(X[:, 1], X[:, 2]))

print(spearmanr(X[:, 0], X[:, 1]))
print(spearmanr(X[:, 0], X[:, 2]))
print(spearmanr(X[:, 1], X[:, 2]))

I imagine this solution may not work for data where linear correlations do not make sense, such as data that favour density-based clustering algorithms.

I have simulated 3 clusters along 1 dimension, and then simulated 3 clusters along 2 dimensions, and then combined them into a dataset... When I visualize the simulated data in 3 dimensions, there are 9 apparent clusters... why two sets of independent, lower-dimensional clusters form apparent clusters in a higher-dimensional space?

Short answer: because 3x3=9, not 6. You created 9 clusters.

To illustrate, let's simplify it: simulate two 1-d arrays, each with 3 clusters.

# Simulate 3 clusters along 1 dimension.

X_1, Y_1 = make_blobs(n_samples = 5000, n_features = 1, centers = 3, cluster_std = 0.3)
X_2, Y_2 = make_blobs(n_samples = 5000, n_features = 1, centers = 3, cluster_std = 0.3)

"Combine" (cross) them.

# Combine dimensions.
X = np.concatenate((X_1, X_2), axis = 1)

Plot on 2D:

import pandas as pd
df = pd.DataFrame(X)
df.plot.scatter(x=0, y=1)

enter image description here

So a clustering algorithm should correctly detect 9 clusters. Shall it detect only 6, wrong (maybe, depends on context too).


In response to OP's additional comment:

  1. I was hoping to understand why the 6 clusters become 9 apparent clusters in 3D space. For example, am I making an conceptual error in how I simulated the 6 clusters? Is this phenomenon likely to happen with real data?

The 9 clusters emerge NOT when you create the 6 clusters on 2 different dimensions, but when crossing two dimensions together. For example, say I have 2 features, "age" and "wealth", each with 3 clusters:

Age: Young, Middle-age, Old

Wealth: Poor, Well-off, Rich

Now if I cross the 2 features (dimensions) together, how many clusters are there? Answer is 9.

(Young, poor), (Young, well-off), (Young, Rich)

(Middle-age, poor), (Middle-age, well-off), (Middle-age, Rich)

(Old, poor), (Old, well-off), (Old, Rich)
  1. How will I recognize it? What about 4D+ space?

You don't need any extra effort to recognize as it is and will always be there. This is Law of Nature, exists as far as the logic of this universe holds.

  1. Can the apparent clusters be reassembled into the actual clusters? Perhaps by combining centroids that have a common value along a certain axis?

Of course you can, in general by dropping some dimensions, and/or apply transformation/projection. For example, if I merge "Rich" and "Well-off" into 1 group, there would be 2x3=6 clusters left.

We call these techniques dimension reduction and feature selection.

  1. Will it always be the case that A clusters along a subset of dimensions and B clusters along another subset of dimensions will become A*B clusters in a higher-dimensional space?

This is something for you to think about, as exercise.

On the other hand, a relevant keyword: Curse of dimensionality


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.