Appropriate Machine Learning algorithm for modeling clustered time-varying binary outcome

I'll just dive right in. I have a decent-size (100K observations) dataset of time-varying continuous and categorical predictors. Categorical predictors, actually, usually do not change, however, continuous one change every day. Another level of complexity - the one that I am struggling with - is the fact that the data are clustered at several levels (measurements coming from the same individual over time, with multiple individuals in the data set).

So, I have something like:

id | day | cont_predictor | cat_predictor | daily_outcome
1  | 1   | 1.4            | 1             | 0
1  | 2   | 1.7            | 1             | 0
1  | 3   | 1.9            | 1             | 0
1  | 4   | 2.9            | 1             | 1
2  | 1   | 4.0            | 0             | 0
2  | 2   | 4.1            | 0             | 0
3  | 1   | 5.7            | 0             | 0
3  | 2   | 4.2            | 0             | 0
3  | 3   | 3.5            | 0             | 1

I am looking for advice on what algorithm is best suited for modeling daily_outcome. This variable is highly imbalanced (30:1) but also the observations are time-varying and are clustered on an individual level (id). I could work with these data using a mixed effect model for longitudinal logistic regression. However, I am most interested in optimizing prediction, therefore I'd really like to use a machine learning approach.

So far, I have tried logistic regression and random forest, to start with, however, they are not performing well when predicting daily_outcome = 1, even after balancing and oversampling techniques including SMOTE. I am assuming it's because these models do not account for the structure of the data, where observations coming from one individual are highly correlated (I've examined those - very high intraclass correlations).

What are some machine learning algorithms that worked for you in the past? Looking for advice from people who worked with this kind of data (seems like this would be a fairly common problem to work on, but it's my first time working on it).

Thank you so much in advance!

Topic time class-imbalance classification

Category Data Science


At first, choosing the best model is probably not as important as getting the data-set setup properly. Most of data science and modelling is spent on building the best data-set to use. Random Forest Classification, SVM, XGBClassifier, even Logistic regression can perform adequately if given the properly prepared data-set to analyze. Also, the metric you are comparing you model to should be appropriate for imbalanced data. Just a FYI, accuracy is not one of them. Here are just a few suggestions to get you started:

Data preparation/cleansing:

  1. you can try more methods to balance, such as down-sampling (think this is usually better than up-sampling as the data you are using was real data, not synthetic). SMOTE is okay also, but can be a bit more complicated to use properly
  2. you can try using a weighting function in your model to give more importance to the minority value, if you think a mistake in predicting that value is more important or more costly than a mistake in the majority, or just want to balance your predictions evenly between the classes
  3. You can use PCA, SVD or LDA to remove some of the correlations between your features
  4. Normalize the data as some algorithms still don't handle large differences in features very well (ex: feature one mean is .001 vs feature 2 mean is 1000). This is a good practice in any modelling.
  5. Look for fliers. Sometimes fliers can cause problems. Look for values that don't make sense, or are too far out of normal range.

Metric to use:

  1. I usually use ROC_AUC to find which model is best at predicting both majority and minority classes.
  2. Sensitivity and Specificity are also 2 good metrics to use to see how good your model is at predicting each class

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.