Semi-Unsupervised learning - kmeans

Given data that contain n vector (m×1 each). I want to cluster the data based on distance. Also, each vector in the data is labeled to some category. I used kmeans algorithm (in Matlab) to cluster the data. I want that non of the clusters will include data from only one category. There is any way to add this constraint, or any algorithm that could do it? Thanks!

Topic matlab clustering machine-learning

Category Data Science


You didn't mention whether you must obtain a specific number of clusters $k$ or not. Assuming that you don't, a simple option to ensure that none of the clusters contains only one category is to reduce the number of clusters $k$:

  • Start by running $k$-means with a large $k$, e.g. 50 (pick this value depending on your data)
  • Check whether any cluster contains only one category. If yes, run $k'$-means with $k'=k-1$.
  • Repeat until the condition is satisfied.

The same idea can be used with hierarchical clustering. The advantage of hierarchical clustering is that you run the algorithm only once, then you can choose at which level to stop in the hierarchy of clusters.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.