# Linear Regression

Method of Least Squares

## Introduction

Suppose you conducted an experiment which resulted in a set of measured or otherwise observed values. For whatever good reasons you assumed to find a specific type of relationship between the measured variables. However, due to errors or inaccuracies in the measurements or in your physical model, the observed values do not perfectly fit a relationship of the expected kind.

*Regression analysis*, or in case of an expected linear relation, *linear regression*
provides a method to determine the best fit relation to the observed data.
An often used form of regression is the *method of least squares*.

In this article, data fitting using the linear least squares method is explained,
needed formulas are derived and an interactive application is provided that finds the linear equation that best fits the data.
The goal of linear least squares is to find the
parameters `a` and `b` of the best fit line
$y=ax+b$.

## JavaScript application

## The least squares line of best fit

The deviation, "error" or *residual* `r`_{i} (fig.1.),
is the difference between an observed value and the value provided by the best fit relationship.
In case of linear regression we want to find a best fit line
$y=ax+b$ so:

$${r}_{i}={y}_{i}-\left(a{x}_{i}+b\right)$$

*Variance* is a useful concept to quantify how much a set of values fluctuates about its mean.
Variance is defined as:

$$\frac{1}{n}\sum _{i=1}^{n}{\left({t}_{i}-\stackrel{\u203e}{t}\right)}^{2}$$

Where
$\stackrel{\u203e}{t}$
represents the arithmetic mean of the set of values and `n` the number of elements in the set.

If we consider the set of residuals:

$${y}_{1}-\left(a{x}_{1}+b\right),{y}_{2}-\left(a{x}_{2}+b\right),\dots ,{y}_{n}-\left(a{x}_{n}+b\right)$$

The mean of the set of residuals must be zero (or at least close to zero, if the line is a good fit), and therefore the variance of this set is:

$$\frac{1}{n}\sum _{i=1}^{n}{\left({y}_{i}-\left(a{x}_{i}+b\right)\right)}^{2}=\frac{1}{n}\sum _{i=1}^{n}{{r}_{i}}^{2}$$

The best fit line is the line in which case the variance is minimal and so where ${E}_{\left(a,b\right)}=\sum _{i=1}^{n}{{r}_{i}}^{2}$ is minimal.

Finding the minimum requires that the *gradient*
is zero, and hence that both partial derivatives with respect to parameters
`a` and `b` are zero:

$$\begin{array}{lll}\frac{\partial E}{\partial a}=0& \text{and}& \frac{\partial E}{\partial b}=0\end{array}$$

Applying the chain rule on each term of the sum:

$$\begin{array}{l}\frac{\partial E}{\partial {r}_{1}}\frac{\partial {r}_{1}}{\partial a}+\frac{\partial E}{\partial {r}_{2}}\frac{\partial {r}_{2}}{\partial a}+\dots +\frac{\partial E}{\partial {r}_{n}}\frac{\partial {r}_{n}}{\partial a}=\\ 2{r}_{1}{x}_{1}+2{r}_{2}{x}_{2}+\dots +2{r}_{n}{x}_{n}=\\ 2\sum _{i=1}^{n}{r}_{i}{x}_{i}=0\iff \sum _{i=1}^{n}{r}_{i}{x}_{i}=0\end{array}$$ and $$\begin{array}{l}\frac{\partial E}{\partial {r}_{1}}\frac{\partial {r}_{1}}{\partial b}+\frac{\partial E}{\partial {r}_{2}}\frac{\partial {r}_{2}}{\partial b}+\dots +\frac{\partial E}{\partial {r}_{n}}\frac{\partial {r}_{n}}{\partial b}=\\ 2{r}_{1}\times 1+2{r}_{2}\times 1+\dots +2{r}_{n}\times 1=\\ 2\sum _{i=1}^{n}{r}_{i}=0\iff \sum _{i=1}^{n}{r}_{i}=0\end{array}$$

Substitute [1] in both equations:

$$\begin{array}{l}\sum _{i=1}^{n}\left({y}_{i}-a{x}_{i}-b\right){x}_{i}=\\ \sum _{i=1}^{n}{x}_{i}{y}_{i}-a{{x}_{i}}^{2}-b{x}_{i}=0\end{array}$$ and $$\sum _{i=1}^{n}{y}_{i}-a{x}_{i}-b=0$$

Which results in the so called *normal equations* (in shortened notation):

$$\sum xy-a\sum {x}^{2}-b\sum x=0$$

$$\sum y-a\sum x-nb=0$$

Solving these normal equations for parameters `a` and `b` provides us
the equation for the best fit line:

From [3]:

$$b=\frac{1}{n}\sum y-a\frac{1}{n}\sum x$$

With $\stackrel{\u203e}{x}=\frac{1}{n}\sum x$ and $\stackrel{\u203e}{y}=\frac{1}{n}\sum y$ being arithmetic means:

$$b=\stackrel{\u203e}{y}-a\stackrel{\u203e}{x}$$

Substitute in [2]:

$$a=\frac{\stackrel{\u203e}{y}\sum x-\sum xy}{\stackrel{\u203e}{x}\sum x-\sum {x}^{2}}$$

## Summary

Let
$\left({x}_{1},{y}_{1}\right),\left({x}_{2},{y}_{2}\right),\dots ,\left({x}_{i},{y}_{i}\right),\dots ,\left({x}_{n},{y}_{n}\right)$
be the `n` observed data points. Then the best fit line trough these data points is the line
$y=ax+b$ with:

$$a=\frac{\stackrel{\u203e}{y}\sum _{i=1}^{n}{x}_{i}-\sum _{i=1}^{n}{x}_{i}{y}_{i}}{\stackrel{\u203e}{x}\sum _{i=1}^{n}{x}_{i}-\sum _{i=1}^{n}{{x}_{i}}^{2}}$$ $$b=\stackrel{\u203e}{y}-a\stackrel{\u203e}{x}$$

With $\stackrel{\u203e}{x}=\frac{1}{n}\sum _{i=1}^{n}{x}_{i}$ and $\stackrel{\u203e}{y}=\frac{1}{n}\sum _{i=1}^{n}{y}_{i}$ being arithmetic means.