# Regression Analysis and Least Squares

March 29, 2018

Back to: Random Testing

In non-mathematical terminology, *regression analysis* involves fitting smooth curves to scattered data. Around 1800, determining the “best” method for regression analysis of the planets’ orbits was a major motivating factor for the development of the Normal Distribution [1], the Central Limit Theorem [2], and the method of Least Squares [5]. Regression analysis is still used today for important data such as global surface temperatures (e.g., Figure 3.17).

In its simplest form, linear regression fits a straight line to a series of data points (*x _{n}*,

*y*),

_{n}*n*= 1, …,

*N*, as illustrated in Figure 3.17. The equation for the line is

*y*=

*ax*+

*b*, where

*a*and

*b*are to be determined. The error, or

*residual*, is the difference between the line and data points at each

*x*:

_{n}*e*

*=*

_{n}*y*–

_{n}*ax*–

_{n}*b*.

### The Least Squares Method

The method of *Least Squares* is used to determine the “best” values for *a* and *b*. This is achieved by summing the squares of the residuals,* e** _{n}*, and finding the values of

*a*and

*b*that minimize the sum. Using calculus, this involves setting the derivatives of the sum to zero with respect to

*a*and

*b*and then solving these two equations for

*a*and

*b*. Equation 19 is the resulting formula:

(1)

(2)

Equation 19

The confidence interval for the curve fit can be obtained by assuming the errors are Normally distributed. The variance of the residuals is given by Equation 20, where *N* – 2 is used in the denominator because the two equations for *a* and *b* have reduced the number of degrees of freedom by two.

(3)

Equation 20

The variance of the determination of the *y _{n}* data,

*σ*

_{yn}^{2}must now be added. It is often assumed that the error bound specified for measurement is equal to 2

*σ*

*, based on a 95% confidence level. Then, the standard deviation for the line values,*

_{y}*y*, is:

(4)

Equation 21

A common measure for the “goodness of fit” for the regression analysis is the *R-square* coefficient given by Equation 22:

(5)

Equation 22

The square root of *R*^{2} is often called the *correlation coefficient*, . In the example of the global surface temperatures in Figure 3.17, *σ _{e}* = 0.133 ºC. The error bound for each temperature measurement is given as ± 0.1 ºC, so it is assumed that

*σ*

*= 0.050 ºC. Then,*

_{yn}*σ*

*= 0.14 ºC and the 95% confidence interval is ± 0.28 ºC. The calculated value for*

_{y}*R*-square is

*R*

^{2}= 0.83.

### Using Log Values for Linear Regression

In many cases, data do not fit a straight line well when using linear scales. Rather than a non-linear curve, it is convenient to use log values for the linear regression. For example, log(*y*) vs. *x*; *y* vs. log(*x*); or log(*y*) vs. log(*x*).

Consider the following: the damping in a structure causes the vibration to decay at the rate of A(*t*) = A_{o} e^{-2π}* ^{f ζ t}* (theoretically) after an impact.

*f*is the frequency in Hz,

*t*is the time in sec., and

*is the critical damping ratio.*

^{ζ}To determine the damping value * ^{ζ}*, the RMS acceleration level is measured versus time after the structure is impacted, as shown in Figure 3.18a for

*f*= 200 Hz. The theoretical decay will be a straight line when plotted as log(A) = log(A

_{o}) – [2

*π*

*f*log(e)]

^{ζ}*t*, shown in Figure 3.18b. In this example, the linear regression equates the slope at -8.28, so

*= 8.28 / [2*

^{ζ}*π*

*f*log(e)] = 0.015.