Lasso (or Ridge) vs Bayesian MAP
This is the first time I have posted here. I am looking for some feedback or perspective on this question.
To make it simple, let's just talk about linear models. We know the MLE solution for the $l_1$ loss objective is the same as the Bayesian MAP estimate with a Laplace prior for each parameter. I'll show it here for convenience.
For vector $Y$ with $n$ observations, matrix $X$, parameters $\beta$, and noise $\epsilon$
$$Y = X\beta + \epsilon,$$ the standard lasso parameter estimate is $$\hat{\beta}_{l_1} = \arg \min_\beta \sum_{i=1}^n (y_i - \beta^Tx_i)^2 + \lambda\sum_{j=1}^p|\beta_j|.$$
We can instead consider the MAP estimate solution $$\hat{\beta}_{MAP} = \arg \max_\beta \prod_{i=1}^n P(\epsilon_i|\beta)P(\beta),$$ where under mean 0 Normal noise $\epsilon$ and mean 0 Laplace priors for $\beta$ we have $$\hat{\beta}_{MAP} = \arg \max_\beta \log \left(\prod_{i=1}^n \dfrac{1}{\sigma\sqrt{2\pi}}e^{-\dfrac{(y_i-\beta^Tx_i)^2}{2\sigma^2}}\right) + \log\left(\prod_{j=1}^p\dfrac{1}{2b}e^{-\dfrac{|\beta_j|}{b}}\right)$$ and after simplifying and substituting $\lambda = \dfrac{1}{b}$ we obtain the MAP estimate $$\hat{\beta}_{MAP} = \arg \min_\beta \sum_{i=1}^n \left(y_i - \beta^Tx_i\right)^2 + \lambda \sum_{j=1}^p |\beta_j| = \hat{\beta}_{l_1}.$$
However, I think there is a subtle nuance that is not captured and was curious if anyone else has insight on this. What I find in practice is that I can typically outperform a Lasso by making insightful variable selections. Essentially, I remove the variables for which I believe a regression will most likely only find high variance parameter estimates. By doing so, I am placing a very informative prior on variables that I eliminate and a very uninformative prior on variables I decide to keep. And this was peculiar to me because I was casually aware of this proof.
I think the main nuance is that $b$, the Laplace variance, is shared by all parameters in this frame work, but when I am selecting things, it is not. I briefly also considered the implications of unquestionably rescaling all $X$ variables to 0 mean and unit variance, but I don't think that impacts the priors on $\beta$ so that they all come from a Laplace with 0 mean and $b$ scale.
However, I did not show it was the case. I am not sure what happens to $\beta_j = \dfrac{Cov(X_j,Y)}{Cov(X_j,X_j)}$ when you transform $\tilde{X_j} = \dfrac{X_j - \bar{X_j}}{\sigma_{X_j}}$. Essentially, $\tilde{\beta}_j = \dfrac{Cov(\tilde{X_j},Y)}{Cov(\tilde{X_j},\tilde{X_j})} = \dfrac{\sigma_{X_j}Cov(X_j,Y)}{Cov(X_j,X_j)} $. Does this rescaling give reasonable grounds to say all $\beta$ come from the same Laplace distribution? If so, why do I still beat out the Lasso? Did I implement the Lasso wrong or cheat?
This is the best explanation I can come up with for why my informative but crude variable selection/Bayesian method outperforms the Lasso even though the Lasso is Bayes-esque. Very curious if anyone else has insights on other things to consider or can show the effects of rescaling $X$ variables. Thanks.
Topic linear-models lasso theory bayesian
Category Data Science