Linear Regression

Method of Least Squares

Plain JavaScript source code

Introduction

Graph with a cloud of points with a best fit straight line through it.
Fig.1. A linear relationship as best fit to a data set of observed values (x, y), with ri being the residual of the i-th observed value.

Suppose you conducted an experiment which resulted in a set of measured or otherwise observed values. Your physical model made you expect to find a certain type of relationship between the measured variables. However, due to errors or inaccuracies in the measurements or in your model, the observed values do not perfectly fit a relationship of the expected kind.

Regression analysis − or in case of an expected linear relation − linear regression (fig.1.) provides a method to determine the best fit relation to the observed data. An often used form of regression is the method of least squares.

The goal of the method of least squares in case of linear regression is to find the parameters a and b of the best fit line y = a x + b .

The deviation or "error" or residual ri (fig.1.), is the difference between an observed value and the value provided by the best fit relationship. In case of linear regression we want to find a best fit line y = a x + b so:

r i = y i a x i + b [1]

Variance is a useful concept to quantify how much a set of values fluctuates about its mean. Variance is defined as:

1 n i = 1 n x i x 2

Where x represents the arithmetic mean of the set and n the number of elements in the set.

If we consider the set of residuals:

y 1 a x 1 + b y 2 a x 2 + b y n a x n + b

The mean of this set must be zero, and therefor the variance is:

1 n i = 1 n y i a x i + b 2

The best fit line is the line in which case the variance is minimal and so where

E a b = i = 1 n y i a x i + b 2 = i = 1 n r i 2

is minimal.

Finding the minimum requires that the gradient is zero, and hence that both partial derivatives with respect to parameters a and b are zero:

E a = 0 and E b = 0

Applying the chain rule on each term of the sum:

E r 1 r 1 a + E r 2 r 2 a + + E r n r n a = 2 r 1 x 1 + 2 r 2 x 2 + + 2 r n x n = 2 i = 1 n r i x i = 0 i = 1 n r i x i = 0 and E r 1 r 1 b + E r 2 r 2 b + + E r n r n b = 2 r 1 × 1 + 2 r 2 × 1 + + 2 r n × 1 = 2 i = 1 n r i = 0 i = 1 n r i = 0

Substitute [1] in both equations:

i = 1 n y i a x i b x i = i = 1 n x i y i a x i 2 b x i = 0 and i = 1 n y i a x i b = 0

Which results in the so called normal equations (in shortened notation):

x y a x 2 b x = 0 [2]

and

y a x n b = 0 [3]

Solving these normal equations for parameters a and b provides us the equation for the best fit line:

From [3]:

b = 1 n y a 1 n x

With x = 1 n x and y = 1 n y being arithmetic means:

b = y a x

Substitute in [2]:

a = y x x y x x x 2