$R^2$ has a nice interpretation as the proportion of variance explained, except when it does not in nonlinear models like many machine learning staples are. However, $R^2$ is, in some sense, equivalent to square loss, so if you are comfortable evaluating your model using a measure of square loss like $MSE = \frac{1}{n}\sum_i(y_i-\hat y_i)^2$ or $RMSE = \sqrt{MSE}$, you should be comfortable using $R^2$.
A common criticism of $R^2$ is that it can be driven arbitrarily high, all the way up to $1$. It is false that $R^2$ can be driven up to $1$ in every case, since two observations can have the same features but a different $y$ value, but $R^2$ certainly can be driven quite high just by including more and more features, whether they are related to the outcome or not.
However, the same is true of $MSE$. Just look at the equation.
$$
R^2 = 1 - \dfrac{\sum_i \big(y_i - \hat y_i\big)^2}
{\sum_i \big(y_i - \bar y\big)^2}\\
=
1 - \dfrac{n \times MSE}
{\sum_i \big(y_i - \bar y\big)^2}\\
=
1 - \dfrac{n \times \big(RMSE\big)^2}
{\sum_i \big(y_i - \bar y\big)^2}
$$
Consequently, any criticism of $R^2$ also has to apply to $MSE$ and $RMSE$.
However, it is completely valid to evaluate $R^2$, $MSE$, and $RMSE$ with out-of-sample data. Then the criticism that you can drive performance high by including unrelated features does not apply; your out-of-sample performance will take a hit when you include too many unrelated features. The out-of-sample equations for $MSE$ and $RMSE$ are exactly the same as for in-sample, but $R^2$ is tweaked.
$$
R^2_{out} = 1 - \dfrac{\sum_i \big(y_i - \hat y_i\big)^2}
{\sum_i \big(y_i - \bar y_{in}\big)^2}
$$
You use the in-sample mean to stay with the spirit of evaluating your performance, compared to the naïve model that always guesses the observed mean.
Even in the OLS linear regression case, however, $R^2_{out}$ lacks the "proportion of variance explained" interpretation. Let's see an example where that $Other$ term in the linked post on Cross Validated is not equal to zero.
set.seed(2021)
N <- 100
other <- function(y, preds, ybar){
return(
sum(
(y - preds)
*
(preds - ybar)
)
)
}
x_all <- runif(N)
y_all <- 7*x_all + rnorm(N)
x_in <- x_all[1:80]
x_out <- x_all[81:100]
y_in <- y_all[1:80]
y_out <- y_all[81:100]
L <- lm(y_in ~ x_in)
preds_in <- predict(L, data.frame(x_in = x_in))
preds_out <- predict(L, data.frame(x_in = x_out))
other_in <- other(y_in, preds_in, mean(y_in))
other_out <- other(y_out, preds_out, mean(y_in))
other_in
other_out
I get an in-sample $Other = 1.3\times 10^{-13}$, which is close enough to $0$ for arithmetic done on a computer, but out-of-sample $Other = -3.3$.