Neural Networks

Layers of simple weighted-sum-plus-nonlinearity units, chained together and trained by backpropagation — the architecture behind modern deep learning.

Each connection is a weight; each node sums its inputs and applies a nonlinearity before passing the result on

Definition

A neural network is built from simple units (neurons) arranged in layers. Each neuron takes a weighted sum of its inputs, adds a bias, and passes the result through a nonlinear activation function (like sigmoid or ReLU) before sending it to the next layer. Stack enough of these layers, and the network can approximate extremely complex functions — this is the basis of "deep learning."

A single neuron computing $\sigma(w \cdot x + b)$ is exactly logistic regression. A neural network is, in a sense, many logistic regressions chained and layered together, each layer learning increasingly abstract features from the previous one.

From logistic regression to a network

Logistic regression draws one straight decision boundary. A small neural network with one hidden layer can combine several straight boundaries (one per hidden neuron) to carve out a much more flexible, curved decision region — exactly what's needed for data that isn't linearly separable.

Try it

If every activation function in a network were the identity (no nonlinearity at all), what would happen to the network's expressive power, no matter how many layers it had?

Solution

It would collapse to being equivalent to a single linear (or affine) transformation — composing any number of linear functions just gives another linear function. The nonlinearity at each layer is exactly what allows depth to add genuine expressive power rather than being mathematically redundant.

Related concepts

Calculus· Multivariable

Gradient DescentAn iterative optimisation algorithm that repeatedly moves in the direction of the negative gradient to find a local minimum of a loss function.

Machine Learning· Supervised Learning

Logistic RegressionModelling the probability of a binary outcome using the sigmoid function — fitting by maximum likelihood or gradient descent.

Machine Learning· Model Training

RegularizationAdding a penalty on model complexity to prevent overfitting — L1 (Lasso) induces sparsity, L2 (Ridge) shrinks coefficients smoothly.

Calculus· Differentiation

Chain RuleHow to differentiate composite functions — the most frequently used rule in calculus, underpinning substitution in integration.