Labeling and aggregating features issue

I am trying build a simple binary classifier (some tree based algorithm for now) and my training data will have features aggregated at the user level. So I'll have a unique records of each user. These aggregated features are like number of logged in sessions, number of times profile button was clicked etc - essentially these are website browse behavior features.

What I am trying to predict is if someone would be interested in subscribing or not. Some users might subscribe immediately after opening an account, some might do after a few days and some may not at all. My labels will be 1 (subscribed) and 0 (not subscribed).

Customers can only subscribe after logging in. So in my dataset I'll have users whose login counts range from 1-N. Hence my aggregation for features will also have wide range of values, because users that have logged in say only once will be smaller feature values than users who have logged in multiple times.

My problem is twofold:

  • Label generation - Should I only select users who have say at-least 3 logged in sessions to assign labels of subscribed on not ? I do this because users who have only one session and have not subscribed will get label as 0 (not subscribed). I don't think I should assign them as label 0 as I don't think I have enough data to correctly conclude that label 0 is apt for them.
  • Say I select users who have at-least 5 logged in sessions and generate aggregate features. I feel my model wont be trained accurately if I have variation in features because of varying number of logged in sessions. ( e.g user A has 3 sessions hence aggregate feature values will be small compared to user B who has say 10 logged in sessions). May be I should level the field by aggregating data from only the fist 3 logged in sessions for each user and see if they subscribed or not in future.

Am I thinking about this correctly ?

Topic labels aggregation xgboost random-forest predictive-modeling

Category Data Science


The goal is to predict if a user will subscribe in the future. By definition you cannot have labelled data now about what people will do in the future. However you can phrase the problem like this: given a set of users at time $t$, predict whether they will subscribe by time $t+u$, where $u$ is for instance 1 year, 1 month or whatever duration fits your data.

Under this definition you can use any past data you have: for every unsubscribed user at time $t$ (e.g. one year ago), label them as "will subscribe" or "won't subscribe" using the data that you have now about them. You can even collect data at different points $t$ in time and calculate for each point $t$ who among the users will be subscribed within a year (i.e. the same user could potentially be used several times as an instance).

Naturally for every time $t$ you should only use users who were not subscribed at this time $t$.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.