by David Manthey (manthey@orbitals.com)

Copyright 1999-2005 by David Manthey

A PDF version of this paper is available here: LEAST.PDF

Introduction

List of Symbols

General Technique

Potential Problems

Linear Technique

Error Residuals, Ellipsoids, and Confidence

An Example Problem

A Linear Solution

A Nonlinear Solution

An Error Ellipse

References

The method of least squares is used to solve a set of linear equations having more equations than unknown variables. Since there are more equations than variables, the solution will not be exactly correct for each equation; rather, the process minimizes the sum of the squares of the residual errors. The method is very powerful and can be applied to numerous applications.

In the general case, the least-squares method is often used to solve a set of non-linear equations that have been linearized using a first-order Taylor-series expansion. Solving non-linear equations is an iterative process using Newton’s method. The speed of convergence is dependent on the quality of an initial guess for the solution. The non-linear least-squares method is often referred to as a bundle adjustment since all of the values of an initial guess of the solution are modified together (adjusted in a bundle). This technique is also occasionally referred to as the Gauss-Newton method.

The least-squares method is not new; Legendre invented the method is 1805, and reference books have mentioned least squares in their titles as early as the 1870s. However, in most books about least squares, the general method is bound inextricably with the book's primary subject matter. There is little uniformity between different books, or even within a single book, on the designation of different variables. This makes it difficult to understand the method. This paper provides a new description of least squares which hopefully describes the process in a simple and straight-forward manner. A complete derivation of the method is not provided, only a functional description. Familiarity with matrices and partial derivatives is assumed.

An example problem is provided, computing the location of a circle based on a set of points. The example shows a linear solution and an iterative nonlinear solution to this problem as well as some error analysis.

a_{i} | coefficients of a linear equation |

e | relative size of adjustment compared to initial value |

F(_{i}x_{1},x_{2},x_{3},...) | equation based
on a set of unknowns. This is not necessarily a function. The equation must be solved for the value (i.e.,
F(_{i}x_{1},x_{2},x_{3},...)=k) whose error will be minimized_{i} |

Fisher(a,v_{1},v_{2}) | the Fisher distribution |

J | the Jacobian matrix. This is the partial differentials of each equation with respect to each unknown. Sometimes written
J=F(_{i}x_{1},x_{2},x_{3},...), where is del, the gradient operator |

K | vector of residuals. Each component is the difference between the observation,
k, and the equation evaluated for the initial guess, _{i}x_{i0} |

k_{i} | an observation. The least-squares process minimizes the error with respect to the observations |

N | the normal matrix. N=J^{t}WJ |

Q | cofactor matrix. The inverse of W. This is typically the standard deviations of the measurements |

Q_{xx} | the covariance matrix. Also called the variance-covariance matrix. Q_{xx}=N^{-1} |

Q'_{xx} | a sub-matrix of the covariance matrix. Used to calculate an error ellipse |

q_{xxij} | a value within the covariance matrix |

r | the degrees of freedom. This is the number of equations minus the number of unknowns |

S_{0}^{2} | the reference variance. This is used to scale the covariance matrix |

S_{axis} | length of the semi-axis of an error ellipse based on the covariance matrix |

S_{axis%} | length of the semi-axis of an error ellipse with a specific confidence level |

W | weighting matrix. This is a square symmetric matrix. For independent measurements, this is a diagonal matrix. Larger values indicate greater significance |

w_{i} | the diagonal components of the weighting matrix |

X | vector of initial guesses for the unknowns |

X' | vector of refined guesses for the unknowns. X'=X+DX |

x_{i} | an unknown value. This is solved for in the least-squares process |

x'_{i} | the refined guess for an unknown value. x'=_{i}x_{i0}+Dx_{i} |

x_{i0} | the initial guess for an unknown |

DX | vector of adjustment values. The initial guesses are adjusted by this amount |

Dx_{i} | an adjustment value. This is added to the initial guess to improve the solution |

G(v) | the Gamma distribution |

The general least squares process can be used to solve a set of equations for a set of unknowns. The only requirement is that there are at least as many equations as there are unknowns.

If the equations are linear, the least-squares process will produce a direct solution for the unknowns. If the equations are not linear, an initial guess of the unknowns is required, and the result is an *adjustment*
to the initial parameters. This is repeated until the results converge (the adjustments become very close to zero). The linear case is an adjustment using zero as the initial guess of all parameters.

The process requires a set of equations with the unknowns on one side and some known quantity on the other. Let *x _{i}* be the set of unknowns, and let the equations be of the form

F* _{i}*(

where *k _{i}* is the observation (value) whose least-squares error will be minimized. Since there are more equations than unknowns, the solution of the unknowns will not be exact. Using
the solution to compute the equation, F

There is typically one equation for each observation. In photogrammetry, this might be one equation for each x pixel coordinate and one equation for each y pixel coordinate. Each equation is not required to have all of the unknowns in it.

The Jacobian matrix, **J**, is the matrix of the partial differentials of each equation with respect to each unknown. That is,

In general, the height of the Jacobian matrix will be larger than the width, since there are more equations than unknowns.

Furthermore, let the vector **K** be the vector of the residuals. A residual is the difference between the observation and the equation calculated using the initial values. That is

One further parameter is a weighting matrix, **W**. This is a matrix which includes the expected confidence of each equation and also includes any dependence of the equations. A larger
value in the weighting matrix increases the importance of the corresponding equation (larger values indicate greater confidence). It is a square symmetric matrix with one row per equation. The main diagonal contains the
weights of the individual equations, while the off-diagonal entries are the dependencies of equations upon one another. If all of the observations are independent, this will be a diagonal matrix. The cofactor matrix,
**Q**, is the inverse of the weighting matrix (i.e., **Q**=**W**^{-1}).

Let *x*_{i0} be an initial guess for the unknowns. The initial guesses can have any finite real value, but the system will converge faster if the guesses are close to the solution. Also, let
**X** be the vector of these initial guesses. That is

It is desirable to solve for the adjustment values, D**X**. This is the vector of adjustments for the unknowns

where, based on the initial guess, **X**, and an adjustment, D**X**, a set of new values are computed

**X**'=**X**+D**X**

To solve for D**X**, (see the references for the reasoning behind this solution)

D**X**=(**J**^{t}**WJ**)^{-1}**J**^{t}**WK**

In various texts, the normal matrix, **N**, is defined as

**N**=**J**^{t}**WJ**,

and the covariance matrix (sometimes referred to as the variance-covariance matrix), **Q**_{xx}, is defined as

**Q**_{xx}=(**J**^{t}**WJ**)^{-1}=**N**^{-1}

If the weighting matrix is diagonal, then D**X** can be solved by row reduction of the matrix

where *w*_{1}, *w*_{2}, *w*_{3}, ... are the diagonal elements of the weighting matrix. Note that the left side of the matrix is the normal matrix, **N**.

The initial guesses, *x*_{i0}, are updated using the solution of the adjustment matrix, D**X**, as follows

**X**'=**X**+D**X**

or

*x*'* _{i}*=

The process is repeated using the new values, *x*'* _{i}*, as the initial guesses until the adjustments are close to zero.

Practically, an adjustment is close to zero when it is small compared to the absolute magnitude of the value it is adjusting, i.e., D*x _{i}* <

There are conditions where the solution will not converge or will converge to undesirable values. This process finds a local minimum. As such, there may be a better solution than the one found. A solution is dependent on the
equations, F* _{i}*(

In any case where the solution does not converge, a solution may still be able to be obtained if different starting values, *x*_{i0}, are used.

For sets of linear equations, the least-squares process will produce a direct solution for the unknowns. The linear case is mathematically the same as the general case where an adjustment is performed using zero as the initial guess of all parameters. Only a single iteration is required for convergence.

The equations must be of the form

F* _{i}*(

The Jacobian matrix, **J**, is therefore

where *a _{ij}* is the

Since the initial guesses are all zero, the vector of residuals, **K**, is

If the weighting matrix is diagonal, then D**X** can be solved by row reduction of the matrix

The final solution will be the adjustment values. That is

**X**=D**X**

or

*x _{i}*=D

The covariance matrix, **Q*** _{xx}*, contains the variance of each unknown and the covariance of each pair of unknowns. The quantities in

where *r* is the number of degrees of freedom (i.e., the number of equations minus the number of unknowns).

For any set of quantities, an error ellipse can be calculated. The dimensions and orientations of the ellipse are calculated from the coefficients of the covariance matrix. Only the coefficients of the covariance matrix in the
relevant rows and columns are used. This is the appropriate *n* x *n* sub-matrix, where *n* is the number of dimensions for the error ellipse. The sub-matrix is symmetric.

The ellipse matrix is composed of entries from the covariance matrix. For example, a three-dimensional error ellipsoid is computed from

where *q _{xxij}* are values from the covariance matrix

The error ellipse semi-axes are given by

The orientation of the error ellipse is the column eigenvectors of **Q**'_{xx}.

To determine the error to a specific confidence level, the length of the semi-axis is multiplied by a confidence factor based on the Fisher distribution using the formula

where the *confidence* is a number from 0 to 1, with 1 being complete confidence, and *r* is the number of degrees of freedom. The Fisher distribution is determined from the equation

where the Gamma function, G(*v*), is given by

Given a set of two-dimensional points, find the circle best represented by these points.

For this example, there are 82 points, see also the figure to the right:

3,-8 | 4,-9 | 5,-9 | 6,-9 | 7,-9 | 8,-9 | 9,-9 | 10,-9 | 11,-8 | 12,-8 |

13,-8 | 14,-8 | 15,-7 | 16,-6 | 16,-5 | 17,-4 | 17,-3 | 17,-2 | 18,-1 | 18,0 |

18,1 | 18,2 | 18,3 | 18,4 | 18,5 | 18,6 | 18,7 | 18,8 | 18,9 | 17,10 |

17,11 | 16,12 | 16,13 | 15,14 | 15,15 | 14,16 | 13,17 | 12,18 | 11,18 | 10,19 |

9,20 | 8,20 | 7,20 | 6,21 | 5,21 | 4,21 | 3,22 | 2,22 | 1,22 | 0,22 |

-1,21 | -2,21 | -3,20 | -4,19 | -5,18 | -6,17 | -7,16 | -7,15 | -8,14 | -8,13 |

-8,12 | -8,11 | -8,10 | -8,9 | -8,8 | -8,7 | -8,6 | -8,5 | -8,4 | -7,3 |

-7,2 | -6,1 | -5,0 | -4,-1 | -4,-2 | -3,-3 | -2,-4 | -2,-5 | -1,-6 | 0,-7 |

1,-7 | 2,-8 |

Let the points be denoted (*x _{i}*,

A circle can be defined by the equation

where (*x*_{0},*y*_{0}) is the center of the circle and *r* is the radius of the circle.

Ideally, it is desirable to minimize the square of the distance between the points defining the circle and the circle. That is, minimize

This is equivalent to performing the least-squares process using the equations

The equation of a circle is not linear in the unknown values *x*_{0}, *y*_{0}, and *r*. The equation of a circle can be
written in the linear form

*A*(*x*^{2}+*y*^{2})+*Bx*+*Cy*=1

where

This can be rewritten for the original unknowns as

Note that by using the linear form of the circle equation, the square of the distance from each point to the circle in not the value that is being minimized.

Using the equation *A*(*x*^{2}+*y*^{2})+*Bx*+*Cy*=1 with the unknowns *A*, *B*, and *C*, the Jacobian matrix is

Placing numerical values in the Jacobian matrix gives

Since all initial guesses are zero, the residual vector is

Lacking other information, the weighting matrix, **W**, is the identity matrix. That is

**W**=**I**

The unknowns, *A*, *B*, and *C*, can be solved for using the equation

**X**=(**J**^{t}**WJ**)^{-1}**J**^{t}**WK**=(**J**^{t}**J**)^{-1}**J**^{t}**K**

This can be simplified to

Numerically, this is

Solving for the unknowns, this is

Solving for the circle parameters, this is

*x*_{0}=4.778760172*y*_{0}=5.875467325*r*=14.67564038

This solution is shown superimposed on the original points in the figure to the right. Note that the circle does not fit the data points very well because the solution used the linear form of the circle equations. A much better fit can be obtained using the nonlinear equations.

Instead of solving for the linear parameters, *A*, *B*, and *C*, it is more desirable to solve for the
values *x*_{0}, *y*_{0}, and *r* using an equation that minimizes the distance from the points to the circle. The following equation is used

The Jacobian matrix is

where

The solution of *x*_{0}, *y*_{0}, and *r* will require multiple iterations. A starting guess for these values is required. The starting guess can be arbitrary provided the equations are valid. A starting guess of

*x*_{0}=0*y*_{0}=0*r*=15

will be used.

Placing numerical values in the Jacobian matrix gives

The residual vector is

Again, lacking other information, the weighting matrix, **W**, is the identity matrix.

The adjustments of the unknowns, D*x*_{0}, D*y*_{0}, and D*r*, can be solved for using the equation

D**X**=(**J**^{t}**WJ**)^{-1}**J**^{t}**WK**=(**J**^{t}**J**)^{-1}**J**^{t}**K**

Numerically, this is

Solving for the adjustment values, this is

The circle parameters are then adjusted to

*x*_{0}'=*x*_{0}+D*x*_{0}=0+6.134768609=6.134768609
*y*_{0}'=*y*_{0}+D*y*_{0}=0+6.649105121=6.649105121
*r*'=*y*+D*r*=15+-2.364891088=12.63510891

This new solution is now used to compute a new Jacobian matrix and a new residual vector, which are then used to again solve for D**X**. The next iteration is

The circle parameters are now adjusted to

*x*_{0}'=*x*_{0}+D*x*_{0}=6.134768609+-1.033761936=5.101006672
*y*_{0}'=*y*_{0}+D*y*_{0}=6.649105121+-0.4464161055=6.202689015
*r*'=*y*+D*r*=12.63510891+1.584613997=14.21972290

Additional iterations are performed until the adjustment values become close to zero. When performing computations with ten significant figures, after nine iterations, the values have converged to

*x*_{0}=5.155701836*y*_{0}=6.233137797*r*=14.24203182

This solution is shown superimposed on the original points in the figure to the right. Note that the circle fits the data points significantly better than the linear solution.

To determine the expected error of the nonlinear solution of the circle, the error ellipse can be calculated. To illustrate this, the error ellipse of the circle’s center will be computed.

The error ellipse is computed using the covariance matrix

Since only the error ellipse for the center of the circle, (*x*_{0},*y*_{0}), is desired, the matrix of interest is

The matrix **Q**'_{xx} is composed of the entries in the first two columns and rows of the matrix **Q**_{xx} since *x*_{0}
corresponds to the first column and *y*_{0} to the second column of the Jacobian matrix.

The eigenvalues of **Q**'_{xx} are

The eigenvectors specify the orientation of the ellipse. The eigenvectors are

The reference variance is computed based on the residual matrix and on the degrees of freedom. The number of degrees of freedom is the number of observations (82 observations – the number of points used to compute the circle)
minus the number of unknowns (3 unknowns - *x*_{0}, *y*_{0}, and *r*). The reference variance is

The error ellipse can be computed for any confidence level. The Fisher distribution needs to be computed for the selected confidence. Although the reference variance was computed based on the total number of unknowns, the Fisher distribution is computed only for the number of unknowns in the ellipse. For 95% confidence

Fisher(1-0.95,2,79)=3.11227

Based on this information, the semi-axes of the error ellipse are

The 95% confidence error ellipse is shown on the same graph as the circle, see figure to the right. The error ellipse has been magnified by a factor of 5 to make it more visible.

The meaning of the 95% confidence error ellipse is that if more points are added to improve the calculation of the circle, there is 95% probability that the center of the circle will remain within the area specified by the ellipse. As expected, as more points are used to compute the circle, the area of the ellipse becomes smaller. If the points lie closer to the actual circle, this will also reduce the ellipse area.

[1] Nash, Stephan G. and Sofer, Ariela, *Linear and Nonlinear Programming*, McGraw-Hill Companies, Inc, New York, 1996. ISBN 0-07-046065-5.

[2] Wolf, Paul R. and Ghilani, Charles D., *Adjustment Computations*, John Wiley and Sons, Inc, New York, 1997. ISBN 0-471-16833-5.