Linear Regression

Fitting a straight line to data to model and predict the relationship between two variables.

Linear regression — y = 0.80x + 1.43 ← best fit line
123456789246810Sum of SquaredResiduals (SSR)0.86min = 0.86residual (error)data point
0 (flat)0.80 (best fit)1.5 (steep)
Definition

Linear regression fits a straight line through a scatter of data points to model the relationship between two variables.

Given data on xx and yy, we find the line y^=a+bx\hat{y} = a + bx that best summarises the pattern — "best" meaning the line minimises the total squared distance between the actual yy values and the line's predictions.

  • bb is the slope: how much yy changes for each unit increase in xx
  • aa is the intercept: the predicted value of yy when x=0x = 0
Predicting exam scores from study time

Data from 8 students:

Hours studiedExam score
145
252
358
464
570
675
779
883

The regression line turns out to be approximately y^=38+5.9x\hat{y} = 38 + 5.9x.

Prediction: a student who studies 5 hours is predicted to score 38+5.9×5=67.538 + 5.9 \times 5 = 67.5 points.

Try it

Using the regression equation y^=38+5.9x\hat{y} = 38 + 5.9x, predict the score for a student who studies 10 hours. Should you trust this prediction? Why or why not?

Solution

y^=38+5.9×10=97\hat{y} = 38 + 5.9 \times 10 = 97. The prediction is 97 points.

Be cautious: this is extrapolation — predicting outside the range of the data (1–8 hours). The relationship may not continue to be linear beyond 8 hours. Also, a score above 100 may not be possible if the test is capped. Regression predictions are most reliable within the range of observed data.

Related concepts

Related reading