How to choose the optimal k in k-protoypes?
To analyze a dataset from banking I have both numerical and categorical values. I transform them to analyze with k-prototypes.
The original dataset:
The modified dataset:
- E.g.: Job (for 1 to 12 'cos there are 12 levels)
Should I scale the dataset before doing the k-prototypes?
How could I determine the optimal "k" to choose (coding)?
I thought to execute:
library(clustMixType)
lbd - lambdaest(BPor)
kpres - kproto(BPor, 5, lambda = lbd) #Change '5' for every possible value of k.
print(kpres)
And then, calculate the sum of within cluster error (choosing the little one).
Topic numerical r categorical-data k-means clustering
Category Data Science