Can clustering my data first help me learn better classifiers?

Question

Can clustering my data first help me learn better classifiers?

Valentin Calomme

2017年10月2日 09:51

I was thinking about this lately. Let's say that we have a very complex space, which makes it hard to learn a classifier that can efficiently split it. But what if this very complex space is actually made up of a bunch of "simple" subspaces. By simple, I mean that it would be easier to learn a classifier for that subspace.

In this situation, would clustering my data first, in other words finding these subspaces, help me learn a better classifier? This classifier would essentially be an ensemble of each subspace's classifier.

To clarify, I don't want to use the clusters as additional features and feed it to a big classifier, I want to train on each cluster individually.

Is this something that's already been done/proven to work/proven to not work? Are there any papers on it? I've been trying to search for things like this but couldn't find anything relevant so I thought I'd ask here.

Topic ensemble meta-learning unsupervised-learning classification clustering

Category Data Science

Jonathan DEKHTIAR · Accepted Answer · 2017年10月2日 09:51

It is absolutely a way to improve your classifier's accuracy. Actually a "strong" enough classifier such as a neural network could be able to learn by itself these clusters. However, you would need a substancially deeper network.

The "smartest" way to do this, if you know there are many groups/clusters in your data is to actually perform a 2-steps process:

Cluster your data
Train X models, one for each of your clusters

A nice way to visualise this is the following problem, you want to build a recommendation engine for a Netflix-like application, you don't want to build one model per person, how would you do this ?

First find clusters of similar users (geeks, SF fans, teenagers, etc.)
Fit one model for each of these clusters

Can clustering my data first help me learn better classifiers?

About