Should I include active services when training a ML model for churn prediction?
I have been trying to build a ML model to predict churn events of our services. The services are subscription based which means they usually have fixed term (1-5 years). And because of that churn usually happens when services are about the expire or already expired (on month-to-month basis).
While the churned services are straightforward, I am struggling with sampling for not churned services. The ones that were renewed were initially labeled as 0's. However the ratio of 0 and 1's defined this way is pretty high, about 2:1. That's why when I predict on active services with the model, the percentage of potential churning events are too high. I have been thinking about including some of the active services in the training data as not churned services especially the ones are about to expire or expired. But then the problem is this particular group is exactly what the business mostly interested in. If they are part of the training as 0's, I can't run prediction on them because that's data leakage. So I am at a dilemma here. Desperately need advice.
Topic churn classification machine-learning
Category Data Science