Survival analisys on LTV for subscription business
I'm trying to predict what's the expected LTV of a subscriber, since monthly revenue and costs are almost constant I need only to predict the survival function, where the terminal event would the subscription cancelation request. I proposed the following formula to estimate LTV:
$LTV = (Membership - Cost)*mean\ residual\ life(x)$
where:
$mean\ residual\ life(x)=E(X-x|Xx)= \frac{\int_{x}^{\infty}S(t)dt}{S(x)}$
In my case I have data of all subscribers over the last 10 years (more than 3 million data points where 1 million are still alive), and I'm trying to predict what's the residual life of users that didn't cancel their subscription.
I only have right censoring today, all events that happened in the past are known.
My question is how do I train and test a model without bias? If I use only data for those who canceled, the survival curve will be understimated since it only sees users that have experienced the event. If I use all my data then I'll fit a curve using data points that later I'll try to predict (I only need the mean residual life for survivors).
Also how do I test my model? I was thinking that maybe I could force a righ censoring 3 months ago, train a model and then check behaviours in the last 3 months considering the real right censor today.
Another possible issue was the pandemic, many people had shorter lifetime because of it. I'm thinking how to handle that.
Any tips or ideas are welcome!
Topic survival-analysis
Category Data Science