I apologize for the delay - I’m currently going through the self-deprecating process of applying to graduate school and have been a little busy with some other projects. As promised, here’s a more in depth explanation of the lasso. Note, I’m a little loose with the language I use in my explanation. Overall, it’s correct but I would definitely get some points marked off on an exam.
The lasso is just another regression and happens to be quite useful in machine learning. If you don’t have a formal maths background (like me) you might be thinking “what is this voodoo stats technique that smart people use”. I first learned about it while using GIFT and black boxed its use at the time. Then it came back to me while I was learning about machine learning and I decided to see what’s under the hood. It’s really nothing wild. Let me put it this way, if you can understand $Y = mx + b$, you can understand any regression (Nick Lazich, 2017). This holds true for the lasso.
Let’s start with the ordinary least squares regression (OLS) equation or better known as a linear regression that we’re all familiar with. If you aren’t, take a peak here. Straight forward enough. $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + … \beta_nX_n + \epsilon$$
And to get the beta estimate (beta hat), all it is the covariance between x and y divided by the variance of x. This beta estimate is then plugged back into the equation above and your on your way to predicting that sweet sweet Y-value.
$$\hat\beta = \frac{\sum^n_{i=1}(y_i-\overline{y})(x_i - \overline{x})}{\sum^n_{i=1}(x_i - \overline{x})^2}$$
Let’s make this beta estimate just a wee bit more complicated.
$$\hat{\beta}^{lasso} = \frac{argmin} {\beta} {\frac{1}{2} \sum^N_{i=1}(y_i - \beta_0 - \sum^p_{j=1}x_{ij}\beta_j)^2 + \lambda \sum^p_{j=1}|\beta_j|}$$
Okay, the equation might look a little intimidating - let’s break it down. The left side is really just the covariance between x and y divided by variance of x in disguise (sum of squared errors). Nothing new here. The thing to note is that the larger this value is, the larger the penalty that is applied to it by the right side is.
We’ll call this right side of the equation the L1 penalty. The larger the lambda value, the bigger the penalty applied to the beta. This is what is called a loss function. Some penalties are large enough to push the beta to 0.
If you’re following along with my machine learning for neuroimagers tutorial, lets take the morphometry feature set. The lasso will push the large coefficients of the features (the betas) to 0. This is how feature selection is accomplished. The larger the lambda variable, the more features that will be pushed to 0. If the loss function is 0, the lasso becomes an ordinary OLS regression and no penalties are applied.
Yep, that’s all there is to it. A simple and powerful tool. In my code I cross-validate the lambda to find the optimal value rather than selecting one from prior knowledge.