Clustering with imbalanced data and groups

I have a problem that is about identifying clusters of highly correlated items. I initially focused on building a model and features that put similar data items close to each other. The main challenge is that I have a case of imbalanced data, as follows:

  • Tens of Millions of items are random and not necessarily correlated.
  • Hundreds of clusters of items (composed of 10-1000s of elements) exist* or may emerge. *I do have partial ground truth for the existing ones.
  • Clusters are very different, in size and properties.

I'd like to return the identified clusters, and the elements within each cluster. F1 should be a good measure.

To move forward, I can think of threshold-based hierarchical clustering. Are there other techniques to consider?

Topic imbalance clustering

Category Data Science


As you have the partial ground-truth (assuming for ALL clusters) I would suggest following a creative idea derived from Region Growing in image segmentation.

As your clusters are imbalanced in number of the points thus density, they are probably captured by localy using DBSCAN. Run DBSCAN with different parameters and evaluate on capturing your ground-truth into right clusters. The partitioning which gives the best result on your ground-truth evaluation will be your final clustering.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.