Multi-target regression tree with additional constraint
I have a regression problem where I need to predict three dependent variables ($y$) based on a set of independent variables ($x$): $$ (y_1,y_2,y_3) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n +u. $$
To solve this problem, I would prefer to use tree-based models (i.e. gradient boosting or random forest), since the independent variables ($x$) are correlated and the problem is non-linear with ex-ante unknown parameterization.
I know that I could use sklearn's MultiOutputRegressor()
or RegressorChain()
as a meta-estimator.
However, there is an additional twist to my problem, namely that I do know that $y_1 + y_2 - y_3 = x_1$.
In other words, there is a fixed relation between the three $y$ and one of the independent variables. So essentially, the value of $x_1$ needs to be distributed in a first best manner to the (unknow) targets $(y_1,y_2,y_3)$ for each observation, contingent on the remaining independent variables $x_2,\dots,x_n$.
Of course a naive approach would be, to squeze the predicted values together somehow, so to satisfy $\hat{y_1} + \hat{y_2} - \hat{y_3} = x_1$. However, I wonder if there are any other options to introduce a hard constraint such as $\hat{y_1} + \hat{y_2} - \hat{y_3} = x_1$ to some (tree-based) estimator.
I noticed this post. However, I would prefer a tree-based method for reasons mentionned above.