Churn Prediction Training Set

I don't understand how to form my dataset from activity(logins etc.) and characteristic(location, age etc.) raw user data.

Ultimately, each row of the training set will have N activity features for a certain period, M characteristic features and a binary outcome - churn or not after the end of this period.

My problem comes from defining the period and the number of rows per users.

The options I see are the following:

  1. Define period from start of user lifetime, 1 week for example. Then each row is 1 user (activity and characteristics) and outcome is whether they churned in week 2 or not.
  2. Break down a user's lifetime into periods. Predict all users every day on the data from their last week. Let's say user has 2 weeks lifetime. Training data will be:

data_week_1, not churn

data_week_2, churn

Looking for any advice or links related to the viability of these or other methods of dataset formation.

Topic churn classification

Category Data Science


A simple way to go would be to use Option 1. Thus, each row could be uniquely identified and you could perform classification easily. You could additionally add a column which mentioned the week in which they churned. This would be very similar to a Type 2 Slowly Changing Dimension. Go wide with your dataset, ie: keep adding as many columns as necessary. That way, even if you have to look at it in Excel before creating a classification model, it'll be easier. If you are using a Random Forest classification algorithm or any other tree based algorthim, this option would make good splits on nodes.

Option 2 would be possible, however, it would result in a very deep dataset which would grow over time if you decide to update it frequently.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.