Nonlinear and Multiple Regression

Nonlinear and Multiple Regression
13 Nonlinear and Multiple Regression Copyright © Cengage Learning. All rights reserved.

13.2 Regression with Transformed Variables
Copyright © Cengage Learning. All rights reserved.

Regression with Transformed Variables
The necessity for an alternative to the linear model Y = 0 + 1x +  may be suggested either by a theoretical argument or else by examining diagnostic plots from a linear regression analysis. In either case, settling on a model whose parameters can be easily estimated is desirable. An important class of such models is specified by means of functions that are “intrinsically linear.”

Definition

Four of the most useful intrinsically linear functions are given in Table 13.1. In each case, the appropriate transformation is either a log transformation—either base 10 or natural logarithm (base e)—or a reciprocal transformation. Useful Intrinsically Linear Functions Table 13.1

Representative graphs of the four functions appear in Figure 13.3. Graphs of the intrinsically linear functions given in Table 13.1 Figure 13.3

For an exponential function relationship, only y is transformed to achieve linearity, whereas for a power function relationship, both x and y are transformed. Because the variable x is in the exponent in an exponential relationship, y increases (if  > 0) or decreases (if  < 0) much more rapidly as x increases than is the case for the power function, though over a short interval of x values it can be difficult to differentiate between the two functions. Examples of functions that are not intrinsically linear are y =  + γex and y =  + γx.

Intrinsically linear functions lead directly to probabilistic models that, though not linear in x as a function, have parameters whose values are easily estimated using ordinary least squares. Definition

The intrinsically linear probabilistic models that correspond to the four functions of Table 13.1 are as follows: a. Y = ex  , a multiplicative exponential model, from which ln(Y) = Y = 0 + 1x +  with x = x, 0 = ln(), 1 = , and  = ln(). Useful Intrinsically Linear Functions Table 13.1

b. Y = x  , a multiplicative power model, so that log(Y) = Y = 0 + 1x +  with x = log(x), 0 = log(x) + , and  = log(). c. Y =  +  log(x) + , so that x = log(x) immediately linearizes the model. d. Y =  +   1/x + , so that x = 1/x yields a linear model. The additive exponential and power models, Y = ex +  and Y = x + , are not intrinsically linear.

Notice that both (a) and (b) require a transformation on Y and, as a result, a transformation on the error variable . In fact, if  has a lognormal distribution with and V() = independent of x, then the transformed models for both (a) and (b) will satisfy all the assumptions regarding the linear probabilistic model; this in turn implies that all inferences for the parameters of the transformed model based on these assumptions will be valid. If  2 is small, Yx  ex in (a) or x in (b).

The major advantage of an intrinsically linear model is that the parameters 0 and 1 of the transformed model can be immediately estimated using the principle of least squares simply by substituting x and y into the estimating formulas: Parameters of the original nonlinear model can then be estimated by transforming back and/or if necessary. (13.5)

Once a prediction interval for y when x = x has been calculated, reversing the transformation gives a PI for y itself. In cases (a) and (b), when  2 is small, an approximate CI for Yx results from taking antilogs of the limits in the CI for (strictly speaking, taking antilogs gives a CI for the median of theY distribution, i.e., for Because the lognormal distribution is positively skewed, ; the two are approximately equal if  2 is close to 0.)

Example 13.3 Taylor’s equation for tool life y as a function of cutting time x states that xyc = k or, equivalently, that y = x. The article “The Effect of Experimental Error on the Determination of Optimum Metal Cutting Conditions” (J. of Engr. for Industry, 1967: 315–322) observes that the relationship is not exact (deterministic) and that the parameters  and  must be estimated from data.

Example 13.3 cont’d Thus an appropriate model is the multiplicative power model Y =   x  , which the author fit to the accompanying data consisting of 12 carbide tool life observations (Table 13.2). Data for Example 3 Table 13.2

Example 13.3 cont’d In addition to the x, y, x, and y values, the predicted transformed values and the predicted values on the original scale ( , after transforming back) are given. The summary statistics for fitting a straight line to the transformed data are xI = , yI = , xI2 = , yI2 = , and xI yI = , so

Example 13.3 cont’d The estimated values of  and , the parameters of the power function model, are and

Example 13.3 Thus the estimated regression function is
cont’d Thus the estimated regression function is   1015  x – To recapture Taylor’s (estimated) equation, set y =  1015  x –5.3996, whence xy.185 = 740.

Example 13.3 cont’d Figure 13.4(a) gives a plot of the standardized residuals from the linear regression using transformed variables (for which r2 = .922); there is no apparent pattern in the plot, though one standardized residual is a bit large, and the residuals look as they should for a simple linear regression. (a) Standardized residuals versus x from Example 3 Figure 13.4

Example 13.3 cont’d Figure 13.4(b) pictures a plot of versus y, which indicates satisfactory predictions on the original scale. ^ (b) y Versus y from Example 3 Figure 13.4

Example 13.3 cont’d To obtain a confidence interval for median tool life when cutting time is 500, we transform x = 500 to x = Then = , and a 95% CI for 0 + 1( ) is  (2.228)(.0824) = (1.928, 2.296). The 95% CI for is then obtained by taking antilogs: (e1.928, e2.296) = (6.876, 9.930). It is easily checked that for the transformed data s2 = ≈ Because this is quite small, (6.876, 9.930) is an approximate interval for

In analyzing transformed data, one should keep in mind the following points: 1. Estimating β 1 and β 0 as in (13.5) and then transforming back to obtain estimates of the original parameters is not equivalent to using the principle of least squares directly on the original model. Thus, for the exponential model, we could estimate 𝛼 and β by minimizing ∑ ( 𝑦 𝑖 − 𝛼𝑒 β𝑥 𝑖 ) 2 . Iterative computation would be necessary. In general, 𝛼 ≠ 𝑒 β 0 and β ≠ β 1 .

2. If the chosen model is not intrinsically linear, the approach summarized in (13.5) cannot be used. Instead, least squares (or some other fitting procedure) would have to be applied to the untransformed model. Thus, for the additive exponential model 𝑌= 𝛼𝑒 β 𝑥 +𝜖, least squares would involve minimizing ∑( 𝑦 𝑖 − 𝛼𝑒 β𝑥 𝑖 ) 2 . Taking partial derivatives with respect to a and b results in two nonlinear normal equations in a and b; these equations must then be solved using an iterative procedure.

3. When the transformed linear model satisfies all the assumptions described in Chapter 12, the method of least squares yields best estimates of the transformed parameters. However, estimates of the original parameters may not be best in any sense, though they will be reasonable. For example, in the exponential model, the estimator 𝛼 = 𝑒 β 0 will not be unbiased, though it will be the maximum likelihood estimator of 𝛼 if the error variable 𝜖 ′ is normally distributed. Using least squares directly (without transforming) could yield better estimates.

4. If a transformation on y has been made and one wishes to use the standard formulas to test hypotheses or construct CIs, 𝜖 ′ should be at least approximately normally distributed. To check this, a normal probability plot of the standardized residuals from the transformed regression should be examined.

5. When y is transformed, the 𝑟 2 value from the resulting regression refers to variation in the 𝑦 𝑖 ′ ’s, explained by the transformed regression model. Although a high value of 𝑟 2 here indicates a good fit of the estimated original nonlinear model to the observed 𝑦 𝑖 ′ ’s, 𝑟 2 does not refer to these original observations. Perhaps the best way to assess the quality of the fit is to compute the predicted values 𝑦 𝑖 ′ using the transformed model, transform them back to the original y scale to obtain 𝑦 𝑖 , and then plot 𝑦 versus y.

A good fit is then evidenced by points close to the 45° line. One could compute 𝑆𝑆𝐸=∑ ( 𝑦 1 − 𝑦 𝑖 ) 2 as a numerical measure of the goodness of fit. When the model was linear, we compared this to 𝑆𝑆𝑇=∑ ( 𝑦 𝑖 − 𝑦 ) 2 , the total variation about the horizontal line at height 𝑦 ; this led to 𝑟 2 . In the nonlinear case, though, it is not necessarily informative to measure total variation in this way, so an 𝑟 2 value is not as useful as in the linear case.

More General Regression Methods

Thus far we have assumed that either Y = f (x) +  (an additive model) or that Y = f (x)   (a multiplicative model). In the case of an additive model, yx = f (x), so estimating the regression function f (x) amounts to estimating the curve of mean y values. On occasion, a scatter plot of the data suggests that there is no simple mathematical expression for f (x). Statisticians have recently developed some more flexible methods that permit a wide variety of patterns to be modeled using the same fitting procedure.

One such method is LOWESS (or LOESS), short for locally weighted scatter plot smoother. Let (x, y) denote a particular one of the n (x, y) pairs in the sample. The value corresponding to (x, y) is obtained by fitting a straight line using only a specified percentage of the data (e.g., 25%) whose x values are closest to x. Furthermore, rather than use “ordinary” least squares, which gives equal weight to all points, those with x values closer to x are more heavily weighted than those whose x values are farther away.

The height of the resulting line above x is the fitted value . This process is repeated for each of the n points, so n different lines are fit (you surely wouldn’t want to do all this by hand). Finally, the fitted points are connected to produce a LOWESS curve.

Example 13.5 Weighing large deceased animals found in wilderness areas is usually not feasible, so it is desirable to have a method for estimating weight from various characteristics of an animal that can be easily determined. Minitab has a stored data set consisting of various characteristics for a sample of n = 143 wild bears.

Example 13.5 cont’d Figure 13.7(a) displays a scatter plot of y = weight versus x = distance around the chest (chest girth). A Minitab scatter plot for the bear weight data Figure 13.7 (a)

Example 13.5 cont’d At first glance, it looks as though a single line obtained from ordinary least squares would effectively summarize the pattern. Figure 13.7(b) shows the LOWESS curve produced by Minitab using a span of 50% [the fit at (x, y) is determined by the closest 50% of the sample]. A Minitab LOWESS curve for the bear weight data Figure 13.7 (b)

Example 13.5 cont’d The curve appears to consist of two straight line segments joined together above approximately x = 38. The steeper line is to the right of 38, indicating that weight tends to increase more rapidly as girth does for girths exceeding 38 in.

It is complicated to make other inferences (e.g., obtain a CI for a mean y value) based on this general type of regression model. The bootstrap technique mentioned earlier can be used for this purpose.

Logistic Regression

Logistic Regression The simple linear regression model is appropriate for relating a quantitative response variable to a quantitative predictor x. Consider now a dichotomous response variable with possible values 1 and 0 corresponding to success and failure. Let p = P(S) = P(Y = 1). Frequently, the value of p will depend on the value of some quantitative variable x.

Logistic Regression For example, the probability that a car needs warranty service of a certain kind might well depend on the car’s mileage, or the probability of avoiding an infection of a certain type might depend on the dosage in an inoculation. Instead of using just the symbol p for the success probability, we now use p(x) to emphasize the dependence of this probability on the value of x. The simple linear regression equation Y = 0 + 1x +  is no longer appropriate, for taking the mean value on each side of the equation gives

Logistic Regression Whereas p(x) is a probability and therefore must be between 0 and 1, 0 + 1x need not be in this range. Instead of letting the mean value of Y be a linear function of x, we now consider a model in which some function of the mean value of Y is a linear function of x. In other words, we allow p(x) to be a function of 0 + 1x rather than 0 + 1x itself. A function that has been found quite useful in many applications is the logit function

Logistic Regression Figure 13.8 shows a graph of p(x) for particular values of 0 and 1 with 1 > 0. A graph of a logit function Figure 13.8

Logistic Regression As x increases, the probability of success increases. For 1 negative, the success probability would be a decreasing function of x. Logistic regression means assuming that p(x) is related to x by the logit function. Straightforward algebra shows that The expression on the left-hand side is called the odds. If, for example, , then when x = 60 a success is three times as likely as a failure.

Logistic Regression We now see that the logarithm of the odds is a linear function of the predictor. In particular, the slope parameter 1 is the change in the log odds associated with a one-unit increase in x. This implies that the odds itself changes by the multiplicative factor when x increases by 1 unit.

Logistic Regression Fitting the logistic regression to sample data requires that the parameters 0 and 1 be estimated. This is usually done using the maximum likelihood technique described in Chapter 6. The details are quite involved, but fortunately the most popular statistical computer packages will do this on request and provide quantitative and pictorial indications of how well the model fits.

Example 13.6 Here is data, in the form of a comparative stem-and-leaf display, on launch temperature and the incidence of failure of O-rings in 23 space shuttle launches prior to the Challenger disaster of 1986 (Y = yes, failed; N = no, did not fail). Observations on the left side of the display tend to be smaller than those on the right side. Stem: Tens digit Leaf : Ones digit

Example 13.6 cont’d Figure 13.9 shows Minitab output for a logistic regression analysis and a graph of the estimated logit function from the R software. (b) graph of estimated logistic function and classification probabilities from R (a) Logistic regression output from Minitab Figure 13.9

Example 13.6 cont’d We have chosen to let p denote the probability of failure. The graph of decreases as temperature increases because failures tended to occur at lower temperatures than did successes. The estimate of 1 and its estimated standard deviation are = –.232 and = .1082, respectively. We assume that the sample size n is large enough here so that has approximately a normal distribution.

Example 13.6 cont’d If 1 = 0 (i.e., temperature does not affect the likelihood of O-ring failure), the test statistic has approximately a standard normal distribution. The reported value of this ratio is z = –2.14, with a corresponding two-tailed P value of .032 (some packages report a chi square value which is just z2, with the same P-value). At significance level .05, we reject the null hypothesis of no temperature effect.

Example 13.6 cont’d The estimated odds of failure for any particular temperature value x is This implies that the odds ratio—the odds of failure at a temperature of x + 1 divided by the odds of failure at a temperature of x—is

Example 13.6 cont’d The interpretation is that for each additional degree of temperature, we estimate that the odds of failure will decrease by a factor of .79 (21%). A 95% CI for the true odds ratio also appears on output. In addition, Minitab provides three different ways of assessing model lack-of-fit: the Pearson, deviance, and Hosmer-Lemeshow tests. Large P-values are consistent with a good model.

Example 13.6 cont’d These tests are useful in multiple logistic regression, where there is more than one predictor in the model relationship so there is no single graph like that of Figure 13.9(b). (b) graph of estimated logistic function and classification probabilities from R Figure 13.9

Example 13.6 cont’d Various diagnostic plots are also available. The R output provides information based on classifying an observation as a failure if the estimated p(x) is at least .5 and as a non failure otherwise. Since p(x) = .5 when x = 64.80, three of the seven failures (Ys in the graph) would be misclassified as non-failures (a misclassification proportion of .429), whereas none of the non-failure observations would be misclassified.

Example 13.6 cont’d A better way to assess the likelihood of misclassification is to use cross-validation: Remove the first observation from the sample, estimate the relationship, then classify the first observation based on this estimated relationship, and repeat this process with each of the other sample observations (so a sample observation does not affect its own classification). The launch temperature for the Challenger mission was only 31°F. This temperature is much smaller than any value in the sample, so it is dangerous to extrapolate the estimated relationship. Nevertheless, it appears that O-ring failure is virtually a sure thing for a temperature this small.

Nonlinear and Multiple Regression

Similar presentations

Presentation on theme: "Nonlinear and Multiple Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nonlinear and Multiple Regression

Similar presentations

Presentation on theme: "Nonlinear and Multiple Regression"— Presentation transcript:

Similar presentations

About project

Feedback