Is it always possible to get well-defined clusters from the data?

I have TV watching data and I have been trying to cluster it to get different sets of watchers. My dataset consists of 64 features (such as total watching time, percent of ads skipped, movies vs. shows, etc.). All the variables are either numerical or binary. But no matter how I treat them (normalize them, standardized, leave them as is, take a subset of features, etc.), I always end up getting pictures similar to this:

This particular picture was constructed after applying t-SNE with 2 components from the scikit-learn library. The picture is similar when using PCA and even when using both PCA and t-SNE combined.

It looks like all the watchers are pretty much the same and that we cannot divide them into clusters. But I highly doubt this. Hence, my question is: is it possible that the data is so homogeneous? Or maybe it is just not possible to visualize it like I am trying to do? Are there maybe some advanced visualization techniques?

Topic tsne pca clustering machine-learning

Category Data Science

First of all, a picture should not be taken to define if there are or no groups on your data, since no matter what projection you use (linear with PCA or manifold with tSNE), you are reducing a 64-dimensional space into a 2-dimensional space, that's a lot of information lost.

Secondly, as far as I know, no theorem guarantees you can find clusters on any given X matrix, probably the opposite yes as per An Impossibility Theorem for Clustering. So to your first question, I sadly would say no.

So I would give you two pieces of advice to validate if there are such groups in your data:

  1. You can try using a projection algorithm before clustering like you already have, but I recommend using UMAP instead of tSNE or PCA.

  2. Use a metric to evaluate cluster separation like inertia if you use K-Means or silhouette if using any other.

This metric should be a good measure that tells you whether or not you have groups and hopefully how many (according to your metrics and caring the number of clusters)

Once you have found both an algorithm and a number of clusters with good metrics, you can run a clustering profiling (analyse a central tendency measure like the average for each feature across clusters) to make some insights based on the cluster's features characteristics.

Then you can plot again your 2D scatter but this time, adding the cluster id like the colour would be more insightful.

Hope it helps

64 features (like, total watching time, percent of ads skipped, movies vs. shows, etc.). All variables are either numerical or binary.

There are several problems with this. It's best to see them when trying to decompose your data into so called main-effects, at least as a mental experiment or excercise.

  1. Don't see what your objective function y(par1, par2, ... par64) is or should be. Examples could be "watching as long as possible", "value of ordered goods", whatever. // It's even less obvious what the ideal objective function is or should look like in your case.

  2. Some of your 64 parameters may carry information (including none), some may be or add just noise (including all). ANalysis Of VAriances (ANOVA) usually gives you a clue, and also the numerical residual error. From this you know e.g. "par23, par12 and par63 really cause a difference in decending order, while all others can't be distinguished from random noise of y, and hence are pooled as total residual error.

BTW, whether or not a parameter turns out to be noise or signal, many times depend on its magnitude of change. Tiny parameter changes may give just ... tiny changes in output y ...

  1. There is a hierarchy of goodness or badness of parameter type for such investigations. It goes like this, from good to bad:
  • best are continuous data (like watching time); they are specific to the process or product you are about to design by intention
  • less meaningful are all percentage (ads skip percentage) or counting data: they tend to be unspecific, i.e. you can map something onto a percentage or count, but the opposite direction is ... ambigous
  • worst are binary, classification etc.: they lost physical meaning completely OR simply have too coarse or too narrow variation. // Making things worse, these are usually very easy to obtain, seducing you into ... noise.
  1. Recapping, consider any so called AI-clustering being functionally not much different from doing polynomial fits: they will always fit the current data set and will most likely fail to predict behavior of new sets.

  2. Besides mathematics there is a human feature to discover pattern in almost anything we recognize. Think of e.g. "the man in the moon", which many of us see because we know how faces tend to look like. But those patterns may or may not be related to truth, i.e how things really are or behave.

  3. Reproduceability is always a challenge in investigations like these. I.e. you probably want to predict future events. This again is very closely related to ideal objective funtion, signal (information carrier) and noise (of parameters).

Is it always possible to get well-defined clusters from the data?

So they answer to your question is clearly a "no". It depends, both on the matter investigated and the way you approach it.

Hope this helps

This is true of any data analytics endeavor. You don't have ANY guarantees that you're going to find what you are looking for in your data.

You have a theory, question, assumptions... and you collect data to see if that fits the reality.

Careful though, absence of prove is not prove of absence. There are reasons why you might not see what you expect. In your case it could be corrupted or mislabeled data, biais in the data collection (pulling from the same cluster)... so some checks and verification are due before drawing conclusions.

So to answer you question, yes it could be possible there are no clusters.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.