K-Means R vs K-Means Python different cluster values generating different bar Graphs

Question

K-Means R vs K-Means Python different cluster values generating different bar Graphs

Leo Torres

2022年5月11日 10:26

Below are 2 sets of code that do the same thing one in Python the other in R. They both graph the Kmeans the same with respect to PCA but once I do the bar chart at the end using the cluster Center the Graphs are totally different. I believe there is something wrong about the Kmeans and the cluster calculation in python. The original code was provided in R. I am trying to see why the bar chart in python does not match are I believe its the centers. Please review and provide some feed back.

Please use the link below to download the data set I used to generate these graphs.

https://www.dropbox.com/s/fhnxxrjl07y0h2c/TableStats2.csv?dl=0

R Code

## Retrive Libraries needed for script
library(RODBC)
library(ggplot2)
library(ggrepel)
library(reshape2)

#Clear work Space
rm(list = ls())

pcp - read.csv(file='E:\\ProgramData\\R\\Code\\TableStats2.csv')
#diable Scientific Notaion
options(scipen = 999)

#Label each row with table Name to Plot names on chart.
data - pcp
rownames(data) - data[, 1]


#Gather all the data and leave out Table Names
data - data[, -1]
data - data[, -1]


#Create The PCA (Principle Component Analysis)
data - scale(data)
pca - prcomp(data)
summary(pca)

plot.data - data.frame(pca$x[, 1:2])

clusters - kmeans(data, 6)
plot.data$clusters - factor(clusters$cluster)

g - ggplot(plot.data, aes(x = PC1, y = PC2, colour = clusters)) +
  geom_point(size = 3.5) +
  geom_text(label = rownames(data), colour = darkgrey, hjust = .7) +
  theme_bw()
print(g)

behaviours - data.frame(clusters$centers)
behaviours$cluster - 1:6
behavious - melt(behaviours, cluster)

g2 - ggplot(behavious, aes(x = variable, y = value)) +
  geom_bar(stat = identity, position = 'identity', fill = steelblue) +
  facet_wrap(~cluster) +
  theme_grey() +
  theme(axis.text.x = element_text(angle = 90)) 

print(g2)

python code

import pandas as pd
import numpy as np
from sklearn import cluster

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

from scipy import linalg as LA
from matplotlib import projections, pyplot as MPL
from sklearn.cluster import KMeans
from plotnine import ggplot, aes, geom_line, geom_bar, facet_wrap, theme_grey, theme, element_text

TableStats = pd.read_csv(r'E:\ProgramData\R\Code\TableStats2.csv')

sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables

features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values

x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
plt.figure(figsize=(20,11))
plot = plt.scatter(x1,y1, c=y.index.tolist())
for i, label in enumerate(y):
  #print(label)
  plt.annotate(label,(x1[i], y1[i]))
plt.show()

df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']) 

clusters = KMeans(n_clusters=6,init='k-means++', random_state=0).fit(df)

clusters.feature_names_in_
df['Cluster'] = clusters.labels_
df['Cluster Centroid D1'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0])
df['Cluster Centroid D2'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1])
df['tables'] = tables

#print Table Names
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x=Cluster Centroid D1, y=Cluster Centroid D2, hue='Cluster', s=1000, palette='coolwarm', legend=False, alpha=0.1)
for line in range(0,df.shape[0]):
    ax.text(x1[line]+0.05, y1[line],TableStats['Object Name'][line], horizontalalignment='left',size='medium', color='black',weight='semibold')

plt.legend(loc='upper right', title='Cluster')
ax.set_title(Clustered Points, fontsize='xx-large', y=1.05);
plt.show()

# here is where the R and Python graphs are different because the cluster centers dont match
behaviours = pd.DataFrame(clusters.cluster_centers_)
behaviours.columns = clusters.feature_names_in_
behaviours['cluster'] = [1,2,3,4,5,6]

b2 = pd.melt(behaviours, id_vars = cluster,value_name=value)

(ggplot(b2, aes(x = 'variable', y = 'value')) + 
geom_bar(stat = identity, position = 'identity', fill = steelblue) + 
facet_wrap('~cluster') + 
theme_grey() + 
theme(axis_text_x = element_text(rotation = 90, hjust=1), figure_size=(20,8)) 
)

Topic k-means clustering

Category Data Science

K-Means R vs K-Means Python different cluster values generating different bar Graphs

About