Linear Regression

Method of Least Squares

Introduction

Suppose you conducted an experiment which resulted in a set of measured or otherwise observed values. For whatever good reasons you assumed to find a specific type of relationship between the measured variables. However, due to errors or inaccuracies in the measurements or in your physical model, the observed values do not perfectly fit a relationship of the expected kind.

Regression analysis, or in case of an expected linear relation, linear regression provides a method to determine the best fit relation to the observed data. An often used form of regression is the method of least squares.

In this article, data fitting using the linear least squares method is explained, needed formulas are derived and an interactive application is provided that finds the linear equation that best fits the data. The goal of linear least squares is to find the parameters a and b of the best fit line $y = a x + b$ .

JavaScript application

The least squares line of best fit

Fig.1. A linear relationship as best fit to a data set of observed values (x, y), with r_i being the residual of the i-th observed value.

The deviation, "error" or residual r_i (fig.1.), is the difference between an observed value and the value provided by the best fit relationship. In case of linear regression we want to find a best fit line $y = a x + b$ so:

$r_{i} = y_{i} - (a x_{i} + b)$ [1]

Variance is a useful concept to quantify how much a set of values fluctuates about its mean. Variance is defined as:

$\frac{1}{n} \sum_{i = 1}^{n}, {(t_{i} - \overline{t})}^{2}$

Where $\overline{t}$ represents the arithmetic mean of the set of values and n the number of elements in the set.

If we consider the set of residuals:

$y_{1} - (a x_{1} + b), y_{2} - (a x_{2} + b), \dots, y_{n} - (a x_{n} + b)$

The mean of the set of residuals must be zero (or at least close to zero, if the line is a good fit), and therefore the variance of this set is:

$\frac{1}{n} \sum_{i = 1}^{n}, {(y_{i} - (a x_{i} + b))}^{2} = \frac{1}{n} \sum_{i = 1}^{n}, {r_{i}}^{2}$

The best fit line is the line in which case the variance is minimal and so where $E_{(a, b)} = \sum_{i = 1}^{n}, {r_{i}}^{2}$ is minimal.

Finding the minimum requires that the gradient is zero, and hence that both partial derivatives with respect to parameters a and b are zero:

$\begin{array}{l} \frac{\partial, E}{\partial, a} = 0 & and & \frac{\partial, E}{\partial, b} = 0 \end{array}$

Applying the chain rule on each term of the sum:

$\begin{array}{l} \frac{\partial, E}{\partial, r_{1}} \frac{\partial, r_{1}}{\partial, a} + \frac{\partial, E}{\partial, r_{2}} \frac{\partial, r_{2}}{\partial, a} + \dots + \frac{\partial, E}{\partial, r_{n}} \frac{\partial, r_{n}}{\partial, a} = \\ 2 r_{1} x_{1} + 2 r_{2} x_{2} + \dots + 2 r_{n} x_{n} = \\ 2 \sum_{i = 1}^{n}, r_{i} x_{i} = 0 \Leftrightarrow \sum_{i = 1}^{n}, r_{i} x_{i} = 0 \end{array}$ and $\begin{array}{l} \frac{\partial, E}{\partial, r_{1}} \frac{\partial, r_{1}}{\partial, b} + \frac{\partial, E}{\partial, r_{2}} \frac{\partial, r_{2}}{\partial, b} + \dots + \frac{\partial, E}{\partial, r_{n}} \frac{\partial, r_{n}}{\partial, b} = \\ 2 r_{1} \times 1 + 2 r_{2} \times 1 + \dots + 2 r_{n} \times 1 = \\ 2 \sum_{i = 1}^{n}, r_{i} = 0 \Leftrightarrow \sum_{i = 1}^{n}, r_{i} = 0 \end{array}$

Substitute [1] in both equations:

$\begin{array}{l} \sum_{i = 1}^{n}, (y_{i} - a x_{i} - b) x_{i} = \\ \sum_{i = 1}^{n}, x_{i} y_{i} - a {x_{i}}^{2} - b x_{i} = 0 \end{array}$ and $\sum_{i = 1}^{n}, y_{i} - a x_{i} - b = 0$

Which results in the so called normal equations (in shortened notation):

$\sum, x y - a \sum, x^{2} - b \sum, x = 0$ [2]

$\sum, y - a \sum, x - n b = 0$ [3]

Solving these normal equations for parameters a and b provides us the equation for the best fit line:

From [3]:

$b = \frac{1}{n} \sum, y - a \frac{1}{n} \sum, x$

With $\overline{x} = \frac{1}{n} \sum, x$ and $\overline{y} = \frac{1}{n} \sum, y$ being arithmetic means:

$b = \overline{y} - a \overline{x}$

Substitute in [2]:

$a = \frac{\overline{y} \sum, x - \sum, x y}{\overline{x} \sum, x - \sum, x^{2}}$

Summary

Let $(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{i}, y_{i}), \dots, (x_{n}, y_{n})$ be the n observed data points. Then the best fit line trough these data points is the line $y = a x + b$ with:

$a = \frac{\overline{y} \sum_{i = 1}^{n}, x_{i} - \sum_{i = 1}^{n}, x_{i} y_{i}}{\overline{x} \sum_{i = 1}^{n}, x_{i} - \sum_{i = 1}^{n}, {x_{i}}^{2}}$ $b = \overline{y} - a \overline{x}$

With $\overline{x} = \frac{1}{n} \sum_{i = 1}^{n}, x_{i}$ and $\overline{y} = \frac{1}{n} \sum_{i = 1}^{n}, y_{i}$ being arithmetic means.