Labeling and aggregating features issue
I am trying build a simple binary classifier (some tree based algorithm for now) and my training data will have features aggregated at the user level. So I'll have a unique records of each user. These aggregated features are like number of logged in sessions, number of times profile button was clicked etc - essentially these are website browse behavior features.
What I am trying to predict is if someone would be interested in subscribing or not
. Some users might subscribe immediately after opening an account, some might do after a few days and some may not at all. My labels will be 1 (subscribed) and 0 (not subscribed).
Customers can only subscribe after logging in. So in my dataset I'll have users whose login counts range from 1-N. Hence my aggregation for features will also have wide range of values, because users that have logged in say only once will be smaller feature values than users who have logged in multiple times.
My problem is twofold:
- Label generation - Should I only select users who have say at-least 3 logged in sessions to assign labels of subscribed on not ? I do this because users who have only one session and have not subscribed will get label as 0 (not subscribed). I don't think I should assign them as label 0 as I don't think I have enough data to correctly conclude that label 0 is apt for them.
- Say I select users who have at-least 5 logged in sessions and generate aggregate features. I feel my model wont be trained accurately if I have variation in features because of varying number of logged in sessions. ( e.g user A has 3 sessions hence aggregate feature values will be small compared to user B who has say 10 logged in sessions). May be I should level the field by aggregating data from only the fist 3 logged in sessions for each user and see if they subscribed or not in future.
Am I thinking about this correctly ?
Topic labels aggregation xgboost random-forest predictive-modeling
Category Data Science