Semi-Unsupervised learning - kmeans

Question

Semi-Unsupervised learning - kmeans

user133019

2022年3月2日 23:30

Given data that contain n vector (m×1 each). I want to cluster the data based on distance. Also, each vector in the data is labeled to some category. I used kmeans algorithm (in Matlab) to cluster the data. I want that non of the clusters will include data from only one category. There is any way to add this constraint, or any algorithm that could do it? Thanks!

Topic matlab clustering machine-learning

Category Data Science

Erwan · Accepted Answer · 2022年3月2日 23:30

You didn't mention whether you must obtain a specific number of clusters $k$ or not. Assuming that you don't, a simple option to ensure that none of the clusters contains only one category is to reduce the number of clusters $k$:

Start by running $k$-means with a large $k$, e.g. 50 (pick this value depending on your data)
Check whether any cluster contains only one category. If yes, run $k'$-means with $k'=k-1$.
Repeat until the condition is satisfied.

The same idea can be used with hierarchical clustering. The advantage of hierarchical clustering is that you run the algorithm only once, then you can choose at which level to stop in the hierarchy of clusters.

Semi-Unsupervised learning - kmeans

About