# Regression Analysis and Least Squares

March 29, 2018

Back to: Random Testing

In non-mathematical terminology, regression analysis involves fitting smooth curves to scattered data. Around 1800, determining the “best” method for regression analysis of the planets’ orbits was a major motivating factor for the development of the normal distribution [1], the central limit theorem [2], and the method of least squares [5]. Regression analysis is still used today for important data such as global surface temperatures (Figure 3.17).

In its simplest form, linear regression fits a straight line to a series of data points (*x _{n}*,

*y*),

_{n}*n*= 1, …,

*N*, as illustrated in Figure 3.17. The equation for the line is

*y*=

*ax*+

*b*, where

*a*and

*b*are to be determined. The error, or residual, is the difference between the line and data points at each

*x*:

_{n}*e*

*=*

_{n}*y*–

_{n}*ax*–

_{n}*b*.

### The Least Squares Method

The method of least squares is used to determine the “best” values for *a* and *b*. This is achieved by summing the squares of the residuals,* e** _{n}*, and finding the values of

*a*and

*b*that minimize the sum. Using calculus, this involves setting the derivatives of the sum to zero with respect to

*a*and

*b*and then solving these two equations for

*a*and

*b*. Equation 19 is the resulting formula:

(1)

(2)

Equation 19

The confidence interval for the curve fit can be obtained by assuming the errors are normally distributed. The variance of the residuals is given in Equation 20, where *N* – 2 is used in the denominator because the two equations for *a* and *b* have reduced the number of degrees of freedom by two.

(3)

Equation 20

The variance of the determination of the *y _{n}* data,

*σ*

_{yn}^{2}, must now be added. It is often assumed that the error bound specified for measurement is equal to 2

*σ*

*based on a 95% confidence level. Then, the standard deviation for the line values,*

_{y}*y*, is:

(4)

Equation 21

A common measure for the “goodness of fit” for the regression analysis is the R-squared coefficient provided in Equation 22:

(5)

Equation 22

The square root of R^{2} is often called the correlation coefficient, . In the example of the global surface temperatures in Figure 3.17, *σ _{e}* = 0.133ºC. The error bound for each temperature measurement is given as ±0.1ºC, so it is assumed that

*σ*

*= 0.050 ºC. Then,*

_{yn}*σ*

*= 0.14 ºC and the 95% confidence interval is ± 0.28 ºC. The calculated value for R-squared is 0.83.*

_{y}### Using Log Values for Linear Regression

In many cases, data do not fit a straight line well when using linear scales. Rather than a non-linear curve, it is convenient to use logarithmic values for the linear regression. For example, log(*y*) vs. *x*; *y* vs. log(*x*); or log(*y*) vs. log(*x*).

Consider the following: the damping in a structure causes the vibration to decay at the theoretical rate of A(*t*) = A_{o} e^{-2π}* ^{f ζ t }*after an impact.

*f*is the frequency in Hz,

*t*is the time in sec., and

*is the critical damping ratio.*

^{ζ}To determine the damping value, * ^{ζ}*, the RMS acceleration level is measured versus time after the structure is impacted (see Figure 3.18a for

*f*= 200 Hz.) The theoretical decay will be a straight line when plotted as log(A) = log(A

_{o}) – [2

*π*

*f*log(e)]

^{ζ}*t*, shown in Figure 3.18b. In this example, the linear regression equates the slope at -8.28, so

*= 8.28 / [2*

^{ζ}*π*

*f*log(e)] = 0.015.