Is it possible to cluster data according to a target?
I was wondering if there exists techniques to cluster data according to a target. For example, suppose we want to find groups of customers likely to churn:
- Target is churn.
- We want to find clusters exhibiting the same behaviour according to the fact that they are likely to churn (or not). Therefore, variables not explaining churn behaviour should not influence how clusters are built.
I have done this analysis the following way:
- Predict target (e.g. using a Random Forest) and retrieve "most important features" (from feature importance analysis).
- Cluster samples with selected features (e.g. using k-means).
However, I am afraid the clustering technique used in the 2nd step might not catch behaviours found in the 1st step which might explain churn (suppose there is a complex interaction in some trees in the RF, this interaction might not be cought in the k-means algorithm).
I was thinking of another way of doing this by using a neural network:
- Predict target using a neural network with several layers, and for each sample retrieve activations from a given layer.
- Cluster samples with their activations.
If the performance of the neural network is good and if the layer from which activations are retrieved is carefully chosen (not too close to the input or the output layer), I suppose the clusters could show customers displaying the same behaviour explaining the target.
I did not find any articles having this approach. Did anyone deal with the same issue or have other ideas?
Topic predictor-importance predictive-modeling clustering
Category Data Science