Logistic Regression for prediction
I would like to ask about the theoretical approach of using Logistic Regression for customer data and more specifically Churn Prediction (in BigQuery and Python).
I have my customer data for an online shop and I would like to predict if the customer will churn based on some characteristics. I have created my dataset and the Churn label (based on the hypothesis that if the customer hasn't bought something in the last year then it is assumed that the customer is churned since we are dealing with a non-contractual setting).
I am using 3 years of data (2019-2021), which includes ~3M customers and 43 features, and as I said, a customer is considered to be churned if the customer didn't place an order in 2021.
- I checked the distribution of my label which is ~balanced.
- I checked for some Logistic Regression assumptions such as multicollinearity, outlier influence etc.
- I split the data into 80% training data, 10% evaluation data, 10% prediction data.
- I checked the model's performance by looking at the classification metrics (Accuracy, Recall etc.)
My question would be:
We have the predictions of the 10% of the data (i.e. the probabilities that a customer will churn). Could we have the probabilities for all the other customers that belong in the training dataset and in the evaluation dataset?
In other words, what would be the next steps after we have trained and have checked that we could use the model, if your final goal would be to have in the end the probabilities of your customers to churn or to not churn?
Thank you in advance for your help!
Topic prediction churn logistic-regression
Category Data Science