Regression Analysis and Least Squares
March 29, 2018
In non-mathematical terminology Regression Analysis involves fitting smooth curves to scattered data. Around 1800, determining the “best” method to do this for the orbits of the planets and comets was a major motivating factor for the development of the Normal Distribution , the Central Limit Theorem , and the method of Least Squares . And regression analysis is still used today for important data (e.g., Fig. 18).
Linear regression in its simplest form fits a straight line to a series of data points (xn, yn), n = 1, …, N, as illustrated in Fig. 18. The equation for the line is y = a x + b, where a and b are to be determined. The error, or residual, is the difference between the line and data points at each xn: en = yn – a xn – b. The method of Least Squares is used to determine the “best” values for a and b. This is done by summing the squares of the residuals, en, and finding the values of a and b that minimize this sum. (Using calculus this involves setting to zero the derivatives of the sum, with respect to a and b, and solving these two equation for a and b.) The resulting formulas are
where are the mean values.
The confidence interval for this curve fit can be obtained by assuming the errors are Normally distributed. The variance of the residuals is given by
where N – 2 is used in the denominator since the two equations for a and b have reduced the number of degrees of freedom by two. To this must be added the variance of the determination of the yn data, σyn2. (It is often assumed that the error bound specified for a measurement is equal to 2 σy, based on a 95% confidence level.) Then the standard deviation for the line values, y, is
A common measure for the “goodness of fit” for the regression analysis is the R-square coefficient given by
The square root of R2 is often called the correlation coefficient, . In the example of the global surface temperatures in Fig. 18, σe = 0.133 ºC. The error bound for each temperature measurement is given as ± 0.1 ºC, so it is assumed that σyn= 0.050 ºC. Then σy = 0.14 ºC and the 95% confidence interval is ± 0.28 ºC. The calculated value for R-square is R2 = 0.83.
In many cases the data do not fit a straight line very well when using linear scales. Rather than trying non-linear curve fits in this format, it is often convenient to use log-values for the linear regression. For example log(y) vs. x; y vs. log(x); or log(y) vs. log (x). As an illustration, the damping in a structure causes the vibration to decay at the rate of A(t) = Ao e-2π f ζ t (theoretically) after an impact (where f is the frequency in Hz, t is the time in sec., and ζ is the critical damping ratio.). So to determine the damping value ζ, the rms acceleration level is measured vs. time after the structure is impacted, as shown in Fig. 19a for f = 200 Hz. The theoretical decay will be a straight line when plotted as log(A) = log(Ao) – [2π f ζ log(e)] t, shown in Fig. 19b. In this example the linear regression gives the slope to be -8.28, so ζ = 8.28 / [2π f log(e)] = 0.015.