Regularization

Adding a penalty on model complexity to prevent overfitting — L1 (Lasso) induces sparsity, L2 (Ridge) shrinks coefficients smoothly.

L1 (Lasso) vs L2 (Ridge) — fit live on a 5-feature dataset as λ changes
0Lasso (L1)w02.73w1-1.70w20.00 ✓0w3 (noise)0.00 ✓0w4 (noise)0.00 ✓0Ridge (L2)w02.25w1-1.14w20.13w3 (noise)0.07w4 (noise)-0.02
λ=0.050
Lasso train MSE
0.149
Ridge train MSE
0.369

True weights are [3, -2, 0.4, 0, 0] — features w3, w4 are pure noise. Drag λ up: Lasso snaps the noise weights to exactly 0 (✓0), Ridge only shrinks them toward zero. Fit is computed in your browser via coordinate descent (Lasso) / gradient descent (Ridge) — no server round-trip.

Definition

Regularization adds a penalty to the model's loss function to discourage overfitting by keeping model parameters small.

The most common forms:

  • L2 (Ridge): Lreg=L+λw2=L+λjwj2\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \|\mathbf{w}\|^2 = \mathcal{L} + \lambda \sum_j w_j^2
  • L1 (Lasso): Lreg=L+λw1=L+λjwj\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \|\mathbf{w}\|_1 = \mathcal{L} + \lambda \sum_j |w_j|

The hyperparameter λ0\lambda \geq 0 controls how much regularization: λ=0\lambda = 0 means no penalty; larger λ\lambda shrinks weights more aggressively.

Ridge regression

Ordinary linear regression: minimize yXw2\|y - Xw\|^2. Ridge regression: minimize yXw2+λw2\|y - Xw\|^2 + \lambda\|w\|^2.

If input features are nearly collinear (multicollinearity), OLS weights blow up. Ridge shrinks all weights toward zero, stabilizing the solution. The ridge solution is w^=(XTX+λI)1XTy\hat{w} = (X^TX + \lambda I)^{-1}X^Ty — always invertible because λI\lambda I is positive definite.

Try it

What's the difference between L1 and L2 regularization in terms of the resulting weights? Which produces a sparser model?

Solution

L2 (Ridge) shrinks all weights proportionally toward zero but rarely sets them exactly to zero. The penalty is smooth and differentiable everywhere, leading to smooth shrinkage.

L1 (Lasso) has a kink at zero (non-differentiable), which creates a "corner solution" — many weights are pushed exactly to zero. Lasso performs automatic feature selection by zeroing irrelevant features.

For a problem with many irrelevant features, Lasso is preferred. For all features being somewhat relevant, Ridge is preferred. Elastic net (αL1+(1α)L2\alpha L_1 + (1-\alpha) L_2) combines both.

Related concepts