Handling gaps in regression model
I'm facing a regression problem where I'm supposed to predict the delay of some trains. There's some peculiar particularity, however: a train is not considered delayed until it has more than 10 mins delays (its delay is 0 otherwise). Therefore, the distribution of target looks like a normal distribution but with a peak at 0.
I tried different approaches to solve the problem.
First approach I fitted some regressors on raw data but there are a lot of predictions in [0,10] interval which is not suitable.
Second approach I tried to make two models : one to predict the probability that a train will have a delay and another to predict the expected delay. The final result of my model is a multiplication of the outputs of the two models. I come across the problem that the RMSE obtained is even worse than the first model. I suspect that it is due to the huge cost of misclassification.
I'm wondering if there are standard methods to deal with such problems and what improvements can be done on what I already did.
Topic distribution methodology regression
Category Data Science