Why Should There Be Multiple Columns in Train Labels for One Model?

Question

Why Should There Be Multiple Columns in Train Labels for One Model?

Della

2022年3月17日 16:09

Going through the notebook on well known kaggle competition of favorita sales forecasting.

One puzzle is, after the data is split for train and testing, it seems y_train has two columns containing unit_sales and transactions, both of which are being predicted, and eventually compared with ground truths.

But why would someone pass these two columns to one model.fit() call instead of developing two models to predict the columns? Or is that what sklearn does internally anyway, i.e. training two models with one fit call? If not, to me it seems just one model for both will give suboptimal results, as the model can be confused between two distinct labels for each data point, and would not know which value to aim for in its weight updates.

Please correct me if I have any misunderstanding of the scenario.

Topic goodness-of-fit loss-function supervised-learning scikit-learn

Category Data Science

Ben Reiniger · Accepted Answer · 2022年3月17日 16:09

is that what sklearn does internally anyway, i.e. training two models with one fit call?

That depends on the model. The (penalized) linear regression models that the notebook starts with do basically that: in a multivariate linear regression, the coefficients for each target are separate, so that essentially you are estimating individual linear models. In these cases, the rest of your questions aren't applicable.

However, the random forest model does something more interesting. The tree chooses splits to decrease the impurity for all the targets at the same time. See sklearn's User Guide. The model won't be "confused" between two distinct labels: it's simultaneously estimating them both. However, the scale of the two targets may be important then, depending on the implementation details; if one target is on a larger scale than the other, its MSE will be larger, and so the tree will favor splits that reduce its error at the expense of the other target. See this example notebook that shows that effect in the sklearn implementation.

Why Should There Be Multiple Columns in Train Labels for One Model?

About