Handling gaps in regression model

I'm facing a regression problem where I'm supposed to predict the delay of some trains. There's some peculiar particularity, however: a train is not considered delayed until it has more than 10 mins delays (its delay is 0 otherwise). Therefore, the distribution of target looks like a normal distribution but with a peak at 0.

I tried different approaches to solve the problem.

First approach I fitted some regressors on raw data but there are a lot of predictions in [0,10] interval which is not suitable.

Second approach I tried to make two models : one to predict the probability that a train will have a delay and another to predict the expected delay. The final result of my model is a multiplication of the outputs of the two models. I come across the problem that the RMSE obtained is even worse than the first model. I suspect that it is due to the huge cost of misclassification.

I'm wondering if there are standard methods to deal with such problems and what improvements can be done on what I already did.

Topic distribution methodology regression

Category Data Science


You may have a look at „hurdle models“. These type of model is a two-stage model, where you first predict if a train will be delayed or not (classification) and if it is delayed, you predict the delay (regression, probably poisson). I guess this is similar to you second approach. However, you may have a look if the standard hurdle model can help with your task.

Alternatively you could look into Generalized Additive Models which are able to capture highly non-linear data generating processes. It may be possible that GAM can pick up the bunched distribution you plotted.

If you have the data, make sure you include the delay of a train at the previous stop to account for „feed forward“ of delays. Overall train delays often follow a „Markov“-like pattern.

Chapter 7.4 (and following) in „Introduction to Statistical Learning“ cover GAM in more detail. https://www.statlearning.com/

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.