Regression Analysis and Least Squares

March 29, 2018

In non-mathematical terminology Regression Analysis involves fitting smooth curves to scattered data. Around 1800, determining the “best” method to do this for the orbits of the planets and comets was a major motivating factor for the development of the Normal Distribution [1], the Central Limit Theorem [2], and the method of Least Squares [5]. And regression analysis is still used today for important data (e.g., Fig. 18).

Regression Analysis - Figure-18

Linear regression in its simplest form fits a straight line to a series of data points (xnyn), n = 1, …, N, as illustrated in Fig. 18. The equation for the line is y = a x + b, where a and b are to be determined. The error, or residual, is the difference between the line and data points at each xn:  en = yn – a xn – b. The method of Least Squares is used to determine the “best” values for a and b. This is done by summing the squares of the residuals, en, and finding the values of a and b that minimize this sum. (Using calculus this involves setting to zero the derivatives of the sum, with respect to a and b, and solving these two equation for a and b.) The resulting formulas are

Regression Analysis - Formula 19

where Regression Analysis - where1 are the mean values.

The confidence interval for this curve fit can be obtained by assuming the errors are Normally distributed. The variance of the residuals is given by

Regression Analysis - Formula 20

where  N – 2  is used in the denominator since the two equations for a and b have reduced the number of degrees of freedom by two. To this must be added the variance of the determination of the yn data, σyn2. (It is often assumed that the error bound specified for a measurement is equal to 2 σy, based on a 95% confidence level.) Then the standard deviation for the line values, y, is

Regression Analysis - Formula 21

A common measure for the “goodness of fit” for the regression analysis is the R-square coefficient given by

Regression Analysis - Formula 22

The square root of R2 is often called the correlation coefficientcorrelation-coeffecitent.  In the example of the global surface temperatures in Fig. 18, σe = 0.133 ºC. The error bound for each temperature measurement is given as ± 0.1 ºC, so it is assumed that σyn= 0.050 ºC. Then σy = 0.14 ºC and the 95% confidence interval is ± 0.28 ºC. The calculated value for R-square is R2 = 0.83.

In many cases the data do not fit a straight line very well when using linear scales. Rather than trying non-linear curve fits in this format, it is often convenient to use log-values for the linear regression. For example log(y) vs. x;  y vs. log(x);  or log(y) vs. log (x). As an illustration, the damping in a structure causes the vibration to decay at the rate of A(t) = Ao e-2π  f ζ t (theoretically) after an impact (where f  is the frequency in Hz,  t  is the time in sec., and ζ  is the critical damping ratio.). So to determine the damping value ζ, the rms acceleration level is measured vs. time after the structure is impacted, as shown in Fig. 19a for f = 200 Hz. The theoretical decay will be a straight line when plotted as log(A) = log(Ao) – [2π  f ζ  log(e)] t, shown in Fig. 19b. In this example the linear regression gives the slope to be -8.28, so ζ = 8.28 / [2π f  log(e)] = 0.015.

Regression Analysis - Figure 19a

Figure 19a.

Regression Analysis - Figure 19b

Figure 19b. Regression Analysis of Transient Vibration Decay