Regularization
Adding a penalty on model complexity to prevent overfitting — L1 (Lasso) induces sparsity, L2 (Ridge) shrinks coefficients smoothly.
True weights are [3, -2, 0.4, 0, 0] — features w3, w4 are pure noise. Drag λ up: Lasso snaps the noise weights to exactly 0 (✓0), Ridge only shrinks them toward zero. Fit is computed in your browser via coordinate descent (Lasso) / gradient descent (Ridge) — no server round-trip.
Regularization adds a penalty to the model's loss function to discourage overfitting by keeping model parameters small.
The most common forms:
- L2 (Ridge):
- L1 (Lasso):
The hyperparameter controls how much regularization: means no penalty; larger shrinks weights more aggressively.
Ordinary linear regression: minimize . Ridge regression: minimize .
If input features are nearly collinear (multicollinearity), OLS weights blow up. Ridge shrinks all weights toward zero, stabilizing the solution. The ridge solution is — always invertible because is positive definite.
What's the difference between L1 and L2 regularization in terms of the resulting weights? Which produces a sparser model?
Solution
L2 (Ridge) shrinks all weights proportionally toward zero but rarely sets them exactly to zero. The penalty is smooth and differentiable everywhere, leading to smooth shrinkage.
L1 (Lasso) has a kink at zero (non-differentiable), which creates a "corner solution" — many weights are pushed exactly to zero. Lasso performs automatic feature selection by zeroing irrelevant features.
For a problem with many irrelevant features, Lasso is preferred. For all features being somewhat relevant, Ridge is preferred. Elastic net () combines both.