I am working currently on churn prediction problem.

As an input I use data from date warehouse for a period 082016 - 032021(one row per month for each customer).

Based on this data I have created a time window of 18 months, where I track customer behaviour(feature engineering).

Based on features, I predict churn in 4 months in the future 122020-032021.

As a model I use lightGBM with the following parameters:

   parameters = {
        'objective': 'binary',
        'metric': 'auc',
        'is_unbalance': 'true',
        'boosting': 'gbdt',
        'num_leaves': 31,
        'feature_fraction': 0.5,
        'bagging_fraction': 0.5,
        'bagging_freq': 20,
        'learning_rate': 0.05,
        'verbose': 0

and get the following as classification report based on test data (training/test split 80/20%):

               precision    recall  f1-score   support

           0       0.96      0.93      0.95     48008
           1       0.68      0.80      0.73      8745

    accuracy                           0.91     56753
   macro avg       0.82      0.86      0.84     56753
weighted avg       0.92      0.91      0.91     56753

In real example I use period 082016-032021 for creating features, and predict churn for next 4 months (042021-072021).

In the last step I create dataset from clients who were active in a month 03/2021 and who have churned in period of 4 months (042021-072021), about 1700 customers.

When I compare predicted values (what says the model), who will churn and real values for churned customers, the model has 44% accuracy. The model can correctly predict only 844 from 1700 customers.

I can not find the reason for such a huge difference between test data and using model in real prediction. Does anybody have the similar experience?


Here is the number of features and observations:

293552 rows × 152 columns

number of not churners - 242385
number of churners     -  51167

I will try cross validation and suggested metrics for churn.

One more question:

What is the best method to determine the threshold in this situation? At the moment, I use exactly what you said: 50%+ = churn, 50% = not churn.

2020 has thrown a lot of models off. I'ld suggest training your model on 2016-2018 and evaluating it on 2019 data. If that looks good, you'll know that your pipeline is fine

You are unlikely to get a useful answer without a lot more details as there are lots of things that could cause this. How many features and how many observations do you have?

It is possible that you have massively overfit your training set:

  • Did you do a lot of hyper parameter tuning on your model?
  • when you fit the light GBM, did you cross validate your results? Perhaps see what a 5 or 10 fold cross validation shows.

If you're predicting data about a future time point,

  • test whether any of your features have massively changed between training and test sets.
  • If I build a model on customer data from 2020, where my business only targeted small businesses, and in 2021 I went after much bigger customers, my model might not be very good at predicting things because the new data, which the model is tested on, has radically shifted.

A few more things:

  1. ROC AUC is unlikely to be useful if your data is massively unbalanced (i.e. if <10% of customers churn). precision recall AUC is better, where your minority class is set to 1 in the model.
  2. Does your model give you a percentage of churning, and then you convert that percentage into "churn/no-churn"? If so, what threshold are you using? If you just unknowingly defaulted to 50%+ = churn, <50% no churn then consider whether that makes sense.

These are just a few ideas, there really is not substitute for exploring your data further though.


