Regularization

Adding a penalty on model complexity to prevent overfitting — L1 (Lasso) induces sparsity, L2 (Ridge) shrinks coefficients smoothly.

L1 (Lasso) vs L2 (Ridge) — fit live on a 5-feature dataset as λ changes

λ=0.050

Lasso train MSE

0.149

Ridge train MSE

0.369

True weights are [3, -2, 0.4, 0, 0] — features w3, w4 are pure noise. Drag λ up: Lasso snaps the noise weights to exactly 0 (✓0), Ridge only shrinks them toward zero. Fit is computed in your browser via coordinate descent (Lasso) / gradient descent (Ridge) — no server round-trip.

Definition

Regularization adds a penalty to the model's loss function to discourage overfitting by keeping model parameters small.

The most common forms:

L2 (Ridge): $\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \|\mathbf{w}\|^2 = \mathcal{L} + \lambda \sum_j w_j^2$
L1 (Lasso): $\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \|\mathbf{w}\|_1 = \mathcal{L} + \lambda \sum_j |w_j|$

The hyperparameter $\lambda \geq 0$ controls how much regularization: $\lambda = 0$ means no penalty; larger $\lambda$ shrinks weights more aggressively.

Ridge regression

Ordinary linear regression: minimize $\|y - Xw\|^2$ . Ridge regression: minimize $\|y - Xw\|^2 + \lambda\|w\|^2$ .

If input features are nearly collinear (multicollinearity), OLS weights blow up. Ridge shrinks all weights toward zero, stabilizing the solution. The ridge solution is $\hat{w} = (X^TX + \lambda I)^{-1}X^Ty$ — always invertible because $\lambda I$ is positive definite.

Try it

What's the difference between L1 and L2 regularization in terms of the resulting weights? Which produces a sparser model?

Solution

L2 (Ridge) shrinks all weights proportionally toward zero but rarely sets them exactly to zero. The penalty is smooth and differentiable everywhere, leading to smooth shrinkage.

L1 (Lasso) has a kink at zero (non-differentiable), which creates a "corner solution" — many weights are pushed exactly to zero. Lasso performs automatic feature selection by zeroing irrelevant features.

For a problem with many irrelevant features, Lasso is preferred. For all features being somewhat relevant, Ridge is preferred. Elastic net ( $\alpha L_1 + (1-\alpha) L_2$ ) combines both.

Related concepts

Needs first

Gradient Descent Linear Regression

Cross-Validation Decision Trees Generalized Linear Models Logistic Regression Norms Support Vector Machines

View in full concept graph →