How do you effectively predict the top 20% most likely customers to churn from a dataset?

I am looking to work out that if I have a dataset with 100,000 existing customers who didn't churn and 20,000 previous customers that churned in the past and the business objective is to target the 20% of customers most likely to churn within the business, how would that be done?

For example, we would have to take this dataset and split it into a training and test set. Let's say the split is an 80/20 ratio for the training and test set respectively. That means that when we build our model on 80% of the data, we can no longer use this data to see if any of the existing customers are likely to churn as we cannot evaluate a model on the data we have used to train it as it. We can only use the remaining 20% of the test set in order to evaluate our model and which customers are more likely to churn or not by considering the probabilities of each customer within the test set. What if there are existing customers who have a high chance of churning that we miss because they were in the training set?

Is what I have said above correct? It's very different from predicting, for example, if someone will default on their loan as you can train on all past data and then use this data to predict on new customers coming into the bank, etc but with churn, you want to predict on the customers who exist in the bank at the moment to avoid them leaving.

Any answers are greatly appreciated.

Topic training prediction churn

Category Data Science


If you use k-fold cross validation, then every customer in your dataset will lie in the test set of one of the k folds. That might help alleviate the issue at hand. Finally you can train your model against your entire train set and run inferences against net new customers.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.