Appropriate Machine Learning algorithm for modeling clustered time-varying binary outcome
I'll just dive right in. I have a decent-size (100K observations) dataset of time-varying continuous and categorical predictors. Categorical predictors, actually, usually do not change, however, continuous one change every day. Another level of complexity - the one that I am struggling with - is the fact that the data are clustered at several levels (measurements coming from the same individual over time, with multiple individuals in the data set).
So, I have something like:
id | day | cont_predictor | cat_predictor | daily_outcome
1 | 1 | 1.4 | 1 | 0
1 | 2 | 1.7 | 1 | 0
1 | 3 | 1.9 | 1 | 0
1 | 4 | 2.9 | 1 | 1
2 | 1 | 4.0 | 0 | 0
2 | 2 | 4.1 | 0 | 0
3 | 1 | 5.7 | 0 | 0
3 | 2 | 4.2 | 0 | 0
3 | 3 | 3.5 | 0 | 1
I am looking for advice on what algorithm is best suited for modeling daily_outcome
. This variable is highly imbalanced (30:1) but also the observations are time-varying and are clustered on an individual level (id
). I could work with these data using a mixed effect model for longitudinal logistic regression. However, I am most interested in optimizing prediction, therefore I'd really like to use a machine learning approach.
So far, I have tried logistic regression and random forest, to start with, however, they are not performing well when predicting daily_outcome
= 1, even after balancing and oversampling techniques including SMOTE. I am assuming it's because these models do not account for the structure of the data, where observations coming from one individual are highly correlated (I've examined those - very high intraclass correlations).
What are some machine learning algorithms that worked for you in the past? Looking for advice from people who worked with this kind of data (seems like this would be a fairly common problem to work on, but it's my first time working on it).
Thank you so much in advance!
Topic time class-imbalance classification
Category Data Science