Can clustering results based on probability be used for supervised learning?

hahaha

2022年5月11日 16:33

I'm a beginner and I have a question.

Can clustering results based on probability be used for supervised learning?

Manufacturing data with 80000 rows. It is not labeled, but there is information that the defect rate is 7.2%.

Can the result of clustering by adjusting hyperparameters based on the defect rate be applied to supervised learning?

Is there a paper like this?

Is this method a big problem from a data perspective?

When using this method, what is the verification method?

Topic unsupervised-learning supervised-learning clustering machine-learning

Category Data Science

Erwan answered at 2022年5月11日 16:33

It's perfectly possible to use the results of clustering as features to train a supervised model... but this is not what you're asking, as far as I understand.

To have any kind of supervised learning, one needs some labelled data for training by definition. "Supervised" means that the model is trained specifically to predict the target variable based on the features, so it needs a representative sample of data with their target.

By contrast "unsupervised", like clustering, means that the model tries to find whatever patterns exist in the features, not particularly in relation with any variable. Sometimes it might happen by chance to correspond more or less to some variable, but there's absolutely no guarantee about that.

So first, in general clustering algorithms don't have hyper-parameters to adjust the proportion of the clusters.

Even assuming the clustering happens to return the desired proportion for a particular cluster, it's not sure at all that this cluster would represent the cases of defect rate.

Basically this method is like walking blindly somewhere: maybe you'll arrive where you want to go, but it's more likely that you won't.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.