Linear Regression
Method of Least Squares
Introduction
Suppose you conducted an experiment which resulted in a set of measured or otherwise observed values. For whatever good reasons you assumed to find a specific type of relationship between the measured variables. However, due to errors or inaccuracies in the measurements or in your physical model, the observed values do not perfectly fit a relationship of the expected kind.
Regression analysis, or in case of an expected linear relation, linear regression provides a method to determine the best fit relation to the observed data. An often used form of regression is the method of least squares.
In this article, data fitting using the linear least squares method is explained, needed formulas are derived and an interactive application is provided that finds the linear equation that best fits the data. The goal of linear least squares is to find the parameters a and b of the best fit line .
JavaScript application
The least squares line of best fit
Fig.1. A linear relationship as best fit to a data set of observed values (x, y), with ri being the residual of the i-th observed value.
The deviation, "error" or residual ri (fig.1.), is the difference between an observed value and the value provided by the best fit relationship. In case of linear regression we want to find a best fit line so:
[1]
Variance is a useful concept to quantify how much a set of values fluctuates about its mean. Variance is defined as:
Where represents the arithmetic mean of the set of values and n the number of elements in the set.
If we consider the set of residuals:
The mean of the set of residuals must be zero (or at least close to zero, if the line is a good fit), and therefore the variance of this set is:
The best fit line is the line in which case the variance is minimal and so where is minimal.
Finding the minimum requires that the gradient is zero, and hence that both partial derivatives with respect to parameters a and b are zero:
Applying the chain rule on each term of the sum:
and
Substitute [1] in both equations:
and
Which results in the so called normal equations (in shortened notation):
[2]
[3]
Solving these normal equations for parameters a and b provides us the equation for the best fit line:
From [3]:
With and being arithmetic means:
Substitute in [2]:
Summary
Let be the n observed data points. Then the best fit line trough these data points is the line with:
With and being arithmetic means.