Appropriate Machine Learning algorithm for modeling clustered time-varying binary outcome

Question

Appropriate Machine Learning algorithm for modeling clustered time-varying binary outcome

Georgiy

2022年2月24日 16:04

I'll just dive right in. I have a decent-size (100K observations) dataset of time-varying continuous and categorical predictors. Categorical predictors, actually, usually do not change, however, continuous one change every day. Another level of complexity - the one that I am struggling with - is the fact that the data are clustered at several levels (measurements coming from the same individual over time, with multiple individuals in the data set).

So, I have something like:

id | day | cont_predictor | cat_predictor | daily_outcome
1  | 1   | 1.4            | 1             | 0
1  | 2   | 1.7            | 1             | 0
1  | 3   | 1.9            | 1             | 0
1  | 4   | 2.9            | 1             | 1
2  | 1   | 4.0            | 0             | 0
2  | 2   | 4.1            | 0             | 0
3  | 1   | 5.7            | 0             | 0
3  | 2   | 4.2            | 0             | 0
3  | 3   | 3.5            | 0             | 1

I am looking for advice on what algorithm is best suited for modeling daily_outcome. This variable is highly imbalanced (30:1) but also the observations are time-varying and are clustered on an individual level (id). I could work with these data using a mixed effect model for longitudinal logistic regression. However, I am most interested in optimizing prediction, therefore I'd really like to use a machine learning approach.

So far, I have tried logistic regression and random forest, to start with, however, they are not performing well when predicting daily_outcome = 1, even after balancing and oversampling techniques including SMOTE. I am assuming it's because these models do not account for the structure of the data, where observations coming from one individual are highly correlated (I've examined those - very high intraclass correlations).

What are some machine learning algorithms that worked for you in the past? Looking for advice from people who worked with this kind of data (seems like this would be a fairly common problem to work on, but it's my first time working on it).

Thank you so much in advance!

Topic time class-imbalance classification

Category Data Science

Donald S · Accepted Answer · 2020年6月19日 05:59

At first, choosing the best model is probably not as important as getting the data-set setup properly. Most of data science and modelling is spent on building the best data-set to use. Random Forest Classification, SVM, XGBClassifier, even Logistic regression can perform adequately if given the properly prepared data-set to analyze. Also, the metric you are comparing you model to should be appropriate for imbalanced data. Just a FYI, accuracy is not one of them. Here are just a few suggestions to get you started:

Data preparation/cleansing:

you can try more methods to balance, such as down-sampling (think this is usually better than up-sampling as the data you are using was real data, not synthetic). SMOTE is okay also, but can be a bit more complicated to use properly
you can try using a weighting function in your model to give more importance to the minority value, if you think a mistake in predicting that value is more important or more costly than a mistake in the majority, or just want to balance your predictions evenly between the classes
You can use PCA, SVD or LDA to remove some of the correlations between your features
Normalize the data as some algorithms still don't handle large differences in features very well (ex: feature one mean is .001 vs feature 2 mean is 1000). This is a good practice in any modelling.
Look for fliers. Sometimes fliers can cause problems. Look for values that don't make sense, or are too far out of normal range.

Metric to use:

I usually use ROC_AUC to find which model is best at predicting both majority and minority classes.
Sensitivity and Specificity are also 2 good metrics to use to see how good your model is at predicting each class

Appropriate Machine Learning algorithm for modeling clustered time-varying binary outcome

About