Optimizing MAE degrades MAE metrics

I have run a lighgbm regression model by optimizing on RMSE and measuring the performance on RMSE:

model = LGBMRegressor(objective=regression, n_estimators=500, n_jobs=8)
model.fit(X_train, y_train, eval_metric=rmse, eval_set=[(X_train, y_train), (X_test, y_test)], early_stopping_rounds=20)

The model keeps improving during the 500 iterations. Here are the performances I obtain on MAE:

MAE on train : 1.080571 MAE on test : 1.258383

But the metric I'm really interested in is MAE, so I decided to optimize it directly (and choose it as the evaluation metric):

model = LGBMRegressor(objective=regression_l1, n_estimators=500, n_jobs=8)
model.fit(X_train, y_train, eval_metric=mae, eval_set=[(X_train, y_train), (X_test, y_test)], early_stopping_rounds=20)

Against all odds, the MAE performance decreases both on train and test:

MAE on train : 1.277689 MAE on test : 1.285950

When I look at the model's logs, it seems stuck in a local minimum and doesn't improve after about 100 trees... Do you think that the problem is linked with the non-differentiability of MAE?

Here are the learning curves:

MAE evolution when optimizing RMSE

MAE evolution when optimizing MAE

Topic lightgbm objective-function

Category Data Science


My guess would be that this is due to the differences between the two measures: compared to MAE, RMSE gives more importance to large errors because of the square. As a result a model optimized on RMSE has a strong incentive to correct its predictions when they are far off the true value, even if these cases are not frequent. By contrast a model optimized on MAE tends to favour obtaining the right prediction for as many instances as possible.

So my hypothesis would be that the model optimized on RMSE happens to find better parameters by trying to solve the large errors first, whereas the model optimized on MAE ends up in a state where it cannot improve on the few cases with large errors without sacrificing the many cases with small error. It should be possible to check by observing which instances the two models predict differently and by how much.

I would also note that the MAE-optimized model doesn't overfit as much as the RMSE one. So I'm not sure that the RMSE model is generally much better than the MAE one, given that the performance difference on the test set is not that high.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.