Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.

Similar presentations


Presentation on theme: "Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality."— Presentation transcript:

1 Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality assumption and residual plots for special situations.

2 Regression Diagnostics The conditions required for inference from simple linear regression must be checked: –Linearity. Diagnostic: Residual plot. –Constant variance. Diagnostic: Residual plot. –Normality. Diagnostic: Histogram of residuals. –Independence. Diagnostic: Residual plot. –Outliers and influential points. Diagnostic: Scatterplot.

3 Data Set: display.JMP A large chain of liquor stores (such as the one supervised by the Pennsylvania Liquor Control Board) would like to know how display space is related to sales (e.g., do wines with more display space sell more?) Chain collected sales and display space from 47 of its stores that do comparable business.

4

5 Transformations in JMP 1.Use Tukey’s Bulging rule (see handout) to determine transformations which might help. 2.After Fit Y by X, click red triangle next to Bivariate Fit and click Fit Special. Experiment with transformations suggested by Tukey’s Bulging rule. 3.Make residual plots of the residuals for transformed model vs. the original X by clicking red triangle next to Transformed Fit to … and clicking plot residuals. Choose transformations which makes the residual plot look most like random scatter with no pattern in the mean of the residuals vs. X. If no transformation works, polynomial regression (Ch. 9) will be needed.

6 Residual plot for transformation to log x looks a little bit better – plot for shows systematic overestimation for DisplayFeet=1 and systematic underestimation for DisplayFeet=2.

7 Case Study 8.1.1 Biologists are interested in the relationship between the area of islands (X) and the number of animal and plant species (Y) living on them. –Estimates of this relationship are useful in conservation biology for predicting species extinction rates due to diminishing habitat. Data in Display 8.1 are number of reptile and amphibian species and the island areas for seven islands in the West Indies.

8 Scatterplots for Species Data Regression function does not appear to be linear.

9 Transformations for Species Data

10 Prediction After Transformation To predict y given x (or to estimate ) when y has been transformed to f(y) and x to g(x), Species Data log-log transformation. Y transformed to log Y, X transformed to log X Predicted number of species given area = 30000: –Predicted number of log species given log area = log(30000)=10.31 equals 1.94+0.25*10.31=4.52. –Predicted number of species given area = 30000 equals exp(predicted number of log species given log area = log(30000)) = exp(4.52) =91.84.

11 Second Prediction Example For display data, the square root transformation of y gives Predicted Sales for DisplayFeet = 5: –Predicted Square Root of Sales for DisplayFeet = 5 equals 9.62+1.44*5 = 16.82 –Predicted Sales for DisplayFeet = 5 equals 16.82 2 =282.91

12 Nonconstant variance Ideal simple linear regression model assumes that Detection: Look at residual plot and see if the spread of the residuals increases as x increases (fan pattern) or the spread of the residuals decreases as x increases (inverse fan pattern). Consequences: If there is nonconstant variance, least squares estimates are still unbiased but tests and confidence intervals are misleading. Correction: Transformations of y variable can be tried.

13 Heteroscedasticity When the requirement of a constant variance is violated we have a condition of heteroscedasticity. Diagnose heteroscedasticity by plotting the residual against the predicted y. + + + + + + + + + + + + + + + + + + + + + + + + The spread increases with y ^ y ^ Residual ^ y + + + + + + + + + + + + + + + + + + + + + + +

14 Normality Assumption Normality assumption: The subpopulation of responses at the different values of X all have a normal distribution. Diagnostic: Plot a histogram of the residuals (by using Save Residuals after Fit Line). The histogram should look bell shaped if the normality assumption holds. Consequences of violation of normality: For large sample sizes (n>30), the confidence intervals and tests for coefficients are robust to violation of normality. However, if prediction intervals are used, departures from normality become important (for any sample size).

15 Diagnostics for Display data

16 Residual plots versus time A lurking variable is a variable that is not included among the explanatory or response variables in a study and yet may influence the interpretation among those variables. Plots of the residuals versus the time order of observation or spatial order of observation can reveal –Serial correlation (violation of independence) –Lurking variables associated with time and space (multiple regression can be used to account for these variables).

17 Residual vs. Time Example Mathematics dept. at large state university must plan number of instructors required for large elementary courses and wants to predict enrollment in elementary math courses (y) based on number of first-year students (x). Data in mathenroll.JMP Residual plot vs. time in JMP: After fit y by x, fit line, click red triangle next to linear fit and click save residuals. Then use fit y by x with y = residuals and x = year.

18 Residual Plots

19 Analysis of Math Enrollment Residual plot versus time order indicates that there must be a lurking variable associated with time, in particular there is a change in the relationship between y and x between 1997 and 1998. In fact, one of schools in the university changed its program to require that entering students take another mathematics course beginning in 1998, increasing enrollment. Implication: Data from before 1998 should not be used to predict future math enrollment.


Download ppt "Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality."

Similar presentations


Ads by Google