Why does changing the cluster number change the plot in Kmeans?

This might be a dumb questions but I can't find the answer to it. I don't have the perfect mathematical understanding of kmeans, so apologies if it is.

I'm just wondering why I see a different plot when I change the number of clusters in a kmeans plot?

Here's the code that I'm using:

set.seed(1)
k - kmeans(data, centers = x)
plotcluster(data, k$cluster)

I vary x to see how the plot looks like. Below are the results for x = 3 and x = 4. Apologies for the poor formatting.

I'm wondering why both plots look different if I'm only varrying the number of clusters. Is it because the principle components that are being show, dc1 and dc2, are different as you change x, so as to maximize the variance displayed?

Another quick question is - can you determine the cluster # by how "neat" the clustering looks in the plot? I know there are various methods of doing it, I'm just wondering if the plot is indicative of good/bad clustering as well.

any help is appreciated!

Topic plotting r k-means clustering

Category Data Science


I believe you may be asking about why the location of the datapoints on the k-means visualization change, when the input data is the same in both cases, with only the value of k changing.

Clustering works in a k-dimensional space, with k being the number of clusters. It uses the input data to map each data point into a new point with k-dimensions.

However, k-dimensional spaces can't be graphed on a rectangle when $k \gt 2$, or on a cube when $k \gt 3$. In order to allow visualization, plotcluster projects the k-dimensional space into a 2-dimensional space, then represents the dimensions in colour. As the k-dimensional data points were different, the final visualized 2-d datapoints are also different.


Basics of kmeans is distance formula it tries to keep data points as close to each other as possible in iterative manner. So when you increase the number of your clusters it tries to make different Centroids (centre of cluster) & then try to find the closest possible data-points. So you have different types of plots each time you increase the number of n. Also keep in mind that colours may wary depending upon the type of plotting you use & everytime they will be different.

So compare different plots with n = 3, comparing n = 3 & 4 is wrong if you want to do comparison on basis of colours & shapes. As they have different centroids now & it in your case you can see that 1st & 4th cluster in 2nd image are overlapping each other. Look into that as well, it may happen that it is wrongly clustering 1st & 4th.


plotcluster changes the projection based on the result that you give. See the documentation for more details.

Because of that you cannot compare the two plots.


why I see a different plot when I change the number of clusters in a kmeans plot?

The clusterization of each observation, in k-Means clsutering, depends from the number of clusters (centroids) that you choose. Their centroids self-adjust by iterating the algorithm, so the position of each of them depends from the position of others. That is why introducing a new centroid shifts the others to a new equilibrium.

--

can you determine the cluster # by how "neat" the clustering looks in the plot?

The common way of choosing their number is using the elbow method. It is based on repeating the algorithm multiple times, each time with a different k, checking the amount of information/error that each clustering returns. By looking at the plot of these iterations you can choose the amount of clusters that satisfies you.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.