Naive Bayes

A probabilistic classifier that applies Bayes' theorem with the (often unrealistic) assumption that features are conditionally independent given the class.

Naive Bayes — posterior ∝ prior × likelihood

P(C₀ | x=3.5)

50.0%

P(C₁ | x=3.5)

50.0%

x=3.5

Definition

Naive Bayes is a probabilistic classifier based on Bayes' theorem, with the "naive" assumption that features are conditionally independent given the class.

For class $C$ and features $x_1, \ldots, x_d$ :

$P(C \mid x_1, \ldots, x_d) \propto P(C) \cdot \prod_{j=1}^d P(x_j \mid C)$

The predicted class is: $\hat{y} = \arg\max_C P(C) \prod_j P(x_j \mid C)$ .

Despite the "naive" independence assumption being almost always wrong in practice, Naive Bayes often works surprisingly well.

Key properties

A generative model — it models how data is produced for each class, not just the decision boundary
Trains extremely fast: just counts and averages, no iterative optimization needed
Needs very little training data relative to more flexible classifiers
Naturally handles many features and many classes without modification

Common mistakes

Zero-frequency problem: an unseen feature value for a class assigns it probability exactly 0, which then forces the entire posterior to 0 regardless of other evidence — Laplace smoothing exists specifically to avoid this
Trusting the predicted probabilities: because the independence assumption is usually false, NB's predicted probabilities tend to be overconfident (pushed toward 0 or 1) even when its classification decisions are correct

Spam filter

Features: "FREE" in subject (yes/no), exclamation marks (count).

$P(\text{spam}) = 0.3$ , $P(\text{FREE}=1 \mid \text{spam}) = 0.6$ , $P(\text{FREE}=1 \mid \text{ham}) = 0.05$ .

For an email with "FREE": posterior $\propto 0.3 \times 0.6 = 0.18$ (spam) vs $0.7 \times 0.05 = 0.035$ (ham). Spam is much more likely.

Try it

Why is Naive Bayes called "naive"? Give an example where the independence assumption clearly fails.

Solution

It's naive because real features are almost never conditionally independent. Example: in text classification, the words "New" and "York" are strongly correlated — if "New" appears, "York" is much more likely. But Naive Bayes treats their probabilities as if they're independent, given the class. This underestimates joint probabilities. Despite this flaw, Naive Bayes often works well because probabilities are only used to determine which class has the highest score, not for exact probability estimates.

Related concepts

Statistics· Probability

Conditional ProbabilityThe probability of an event given that another event has already occurred.

Machine Learning· Supervised Learning

Linear Discriminant AnalysisA classification method that finds the linear combination of features maximising between-class separation relative to within-class scatter.

Machine Learning· Supervised Learning

Logistic RegressionModelling the probability of a binary outcome using the sigmoid function — fitting by maximum likelihood or gradient descent.

Machine Learning· Model Training

Model EvaluationConfusion matrices, accuracy, precision, recall, F1 score, ROC curves, and AUC — the toolkit for measuring classifier and regressor performance.