Regression Analysis and Least Squares

March 29, 2018

In non-mathematical terminology, regression analysis involves fitting smooth curves to scattered data. Around 1800, determining the “best” method for regression analysis of the planets’ orbits was a major motivating factor for the development of the normal distribution [1], the central limit theorem [2], and the method of least squares [5]. Regression analysis is still used today for important data such as global surface temperatures (Figure 3.17).

Linear regression in its simplest form

Figure 3.17. Linear regression of global surface temperatures.

In its simplest form, linear regression fits a straight line to a series of data points (xnyn), n = 1, …, N, as illustrated in Figure 3.17. The equation for the line is y = ax + b, where a and b are to be determined. The error, or residual, is the difference between the line and data points at each xn:  en = yn – axn – b.

The Least Squares Method

The method of least squares is used to determine the “best” values for a and b. This is achieved by summing the squares of the residuals, en, and finding the values of a and b that minimize the sum. Using calculus, this involves setting the derivatives of the sum to zero with respect to a and b and then solving these two equations for a and b. Equation 19 is the resulting formula:

(1)   \begin{equation*} a=\frac{\sum_{n-1}^{N}(x_{n}-\bar{x})(y_{n}-\bar{y})}{\sum_{n-1}^{N}(x_{n}-\bar{x})^2}; b=\bar{y}-a\bar{x} \end{equation*}

(2)   \begin{equation*} \text{where the mean values are }\bar{x}=\frac{1}{N}\sum_{n=1}^{N}x_{n}\text{ and }\bar{y}=\frac{1}{N}\sum_{n=1}^{N}y_{n} \end{equation*}

Equation 19

The confidence interval for the curve fit can be obtained by assuming the errors are normally distributed. The variance of the residuals is given in Equation 20, where  N – 2  is used in the denominator because the two equations for a and b have reduced the number of degrees of freedom by two.

(3)   \begin{equation*} \sigma_{\varepsilon}^2=\frac{1}{N-2}\sum_{n=1}^{N}\varepsilon_{n}^2 \end{equation*}

Equation 20

The variance of the determination of the yn data, σyn2, must now be added. It is often assumed that the error bound specified for measurement is equal to 2σy based on a 95% confidence level. Then, the standard deviation for the line values, y, is:

(4)   \begin{equation*} \sigma_{y}=\sqrt{\sigma_{\varepsilon}^2+\sigma_{y_{n}}^2} \end{equation*}

Equation 21

A common measure for the “goodness of fit” for the regression analysis is the R-squared coefficient provided in Equation 22:

(5)   \begin{equation*} R^2=1-\frac{\sum_{n=1}^{N}\varepsilon_{n}^2}{\sum_{n=1}^{N}(y_{n}-\bar{y})^2} \end{equation*}

Equation 22

The square root of R2 is often called the correlation coefficient, r=\sqrt{R^2}. In the example of the global surface temperatures in Figure 3.17, σe = 0.133ºC. The error bound for each temperature measurement is given as ±0.1ºC, so it is assumed that σyn= 0.050 ºC. Then, σy = 0.14 ºC and the 95% confidence interval is ± 0.28 ºC. The calculated value for R-squared is 0.83.

Using Log Values for Linear Regression

In many cases, data do not fit a straight line well when using linear scales. Rather than a non-linear curve, it is convenient to use logarithmic values for the linear regression. For example, log(y) vs. x;  y vs. log(x); or log(y) vs. log(x).

Consider the following: the damping in a structure causes the vibration to decay at the theoretical rate of A(t) = Ao e-2πf ζ t after an impact. f is the frequency in Hz, t is the time in sec., and ζ  is the critical damping ratio.

To determine the damping value, ζ, the RMS acceleration level is measured versus time after the structure is impacted (see Figure 3.18a for f = 200 Hz.) The theoretical decay will be a straight line when plotted as log(A) = log(Ao) – [2π  f ζ  log(e)] t, shown in Figure 3.18b. In this example, the linear regression equates the slope at -8.28, so ζ = 8.28 / [2π f  log(e)] = 0.015.

Regression Analysis - Figure 19a

Figure 3.18a. RMS acceleration level measured vs. time after the structure is impacted for f = 200Hz.

Regression Analysis - Figure 19b

Figure 3.18b. Regression analysis of transient vibration decay.