Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11: Inferential methods in Regression and Correlation

Similar presentations


Presentation on theme: "Chapter 11: Inferential methods in Regression and Correlation"— Presentation transcript:

1 Chapter 11: Inferential methods in Regression and Correlation http://jonfwilkins.blogspot.com/2011_08_01_archive.html

2 Example: distribution of y The relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment has a linear regression equation of y = 20.11 – 0.526x + e with σ = 6.52. a)What is the mean value of y when x = 30? x = 50? x = 70? b)What is the standard deviation of y when x = 30? x = 50? x = 70?

3 Example: Estimating  and  The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil. a)What are the point estimates of  and  ? b)What is a point estimate of the true average cetane number whose iodine value is 100?

4 Example: Estimating  and  (cont) x:132.0129.0120.0113.2105.092.084.0 y:46.048.051.052.154.052.059.0 x:83.288.459.080.081.571.069.2 y:58.761.664.061.454.658.858.0 a)What are the point estimates of  and  ?

5 Example: Estimating  and  (cont)

6 x: 132.0129.0120.0113.2105.092.084.0 y: 46.048.051.052.154.052.059.0 x: 83.288.459.080.081.571.069.2 y: 58.761.664.061.454.658.858.0

7 Example: Estimating  and  (cont) b) What is a point estimate of the true average cetane number whose iodine value is 100?

8 Example: Estimating  and  The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil. c) Find the point estimate of the error standard deviation, σ. d) What proportion of the observed variation in y can be attributed to the simple linear regression relationship between x and y?

9 Example: Estimating  and  (cont) c) Find the point estimate of the error standard deviation, σ. d) What proportion of the observed variation in y can be attributed to the simple linear regression relationship between x and y?

10 Example: Estimating  and  (SAS) The REG Procedure Model: MODEL1 Dependent Variable: cetane Number of Observations Read 15 Number of Observations Used 14 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 298.25443 298.25443 45.35 <.0001 Error 12 78.91986 6.57665 Corrected Total 13 377.17429 Root MSE 2.56450 R-Square 0.7908 Dependent Mean 55.65714 Adj R-Sq 0.7733 Coeff Var 4.60767 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 75.21243 2.98363 25.21 <.0001 iodine 1 -0.20939 0.03109 -6.73 <.0001

11 Example: CI The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil. e) What is the 95% CI for the true slope?

12 Example: Output (SAS) The SAS System 09:20 Thursday, November 10, 2011 3 The REG Procedure Model: MODEL1 Dependent Variable: cetane Number of Observations Read 15 Number of Observations Used 14 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 298.25443 298.25443 45.35 <.0001 Error 12 78.91986 6.57665 Corrected Total 13 377.17429 Root MSE 2.56450 R-Square 0.7908 Dependent Mean 55.65714 Adj R-Sq 0.7733 Coeff Var 4.60767 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits Intercept 1 75.21243 2.98363 25.21 <.0001 68.71165 81.71321 iodine 1 -0.20939 0.03109 -6.73 <.0001 -0.27713 -0.14164 S xx = 6802.7693

13 Example: Hypothesis test The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil. f) Is the model useful (that is, is there a useful linear relationship between x and y)?

14 Example: Hypothesis test (SAS) The REG Procedure Model: MODEL1 Dependent Variable: cetane Number of Observations Read 15 Number of Observations Used 14 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 298.25443 298.25443 45.35 <.0001 Error 12 78.91986 6.57665 Corrected Total 13 377.17429 Root MSE 2.56450 R-Square 0.7908 Dependent Mean 55.65714 Adj R-Sq 0.7733 Coeff Var 4.60767 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 75.21243 2.98363 25.21 <.0001 iodine 1 -0.20939 0.03109 -6.73 <.0001

15 Summary Slide SourcedfSSMS Model (Regression) 1SSR Errorn - 2SST – b S xy Totaln - 1S yy

16 Example: ANOVA (SAS) The REG Procedure Model: MODEL1 Dependent Variable: cetane Number of Observations Read 15 Number of Observations Used 14 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 298.25443 298.25443 45.35 <.0001 Error 12 78.91986 6.57665 Corrected Total 13 377.17429 Root MSE 2.56450 R-Square 0.7908 Dependent Mean 55.65714 Adj R-Sq 0.7733 Coeff Var 4.60767 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 75.21243 2.98363 25.21 <.0001 iodine 1 -0.20939 0.03109 -6.73 <.0001

17 Example: Hypothesis test for  The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil. g) Is the model useful (that is, is there a useful linear relationship between x and y) using the population correlation coefficient?

18 Example: ANOVA (SAS) The REG Procedure Model: MODEL1 Dependent Variable: cetane Number of Observations Read 15 Number of Observations Used 14 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 298.25443 298.25443 45.35 <.0001 Error 12 78.91986 6.57665 Corrected Total 13 377.17429 Root MSE 2.56450 R-Square 0.7908 Dependent Mean 55.65714 Adj R-Sq 0.7733 Coeff Var 4.60767 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 75.21243 2.98363 25.21 <.0001 iodine 1 -0.20939 0.03109 -6.73 <.0001

19 Example: Hypothesis test for  (2) In some locations, there is a strong association between concentrations for two different pollutants. The following data consists of the concentrations of x = ozone (ppm) and y = secondary carbon concentration (μg/m 3 ). x0.0660.0880.1200.0500.1620.1860.0570.100 y4.611.69.56.313.815.42.511.8 x0.1120.0550.1540.0740.1110.1400.0710.110 y8.07.020.616.69.217.92.813.0

20 Example: Hypothesis test for  (2) x y

21

22 Using the population correlation coefficient, is this model useful? The summary statistics are:

23 Example: Hypothesis test for  (2) Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 222.47934 222.47934 14.69 0.0018 Error 14 212.05816 15.14701 Corrected Total 15 434.53750 Root MSE 3.89192 R-Square 0.5120 Dependent Mean 10.66250 Adj R-Sq 0.4771 Coeff Var 36.50097 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 0.99801 2.70292 0.37 0.7175 x 1 93.37670 24.36448 3.83 0.0018

24 Example: Hypothesis test for  (2) The REG Procedure Model: MODEL1 Dependent Variable: cetane Number of Observations Read 15 Number of Observations Used 14 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 298.25443 298.25443 45.35 <.0001 Error 12 78.91986 6.57665 Corrected Total 13 377.17429 Root MSE 2.56450 R-Square 0.7908 Dependent Mean 55.65714 Adj R-Sq 0.7733 Coeff Var 4.60767 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 75.21243 2.98363 25.21 <.0001 iodine 1 -0.20939 0.03109 -6.73 <.0001

25 Example: Hypothesis test for  (2)

26

27 Multiple Linear Regression http://www.vias.org/science_cartoons/multilin_regression.html

28 Example: Multiple Linear Regression It is important to know how long a tool will last (min) in the industrial setting. The cutting tool in this study is used to cut a particular type and size of cold-rolled steel. The predictors of interest are x 1 = cutting speed (feet/min), x 2 = feed rate (in/revolution) and x 3 = depth of cut (in). The predicted model is y = 101.765 – 0.0958 x 1 – 667.972 x 2 - 472.304 x 3 + e a) What is the mean life of a tool that is being used to cut depths of 0.03 inch at a speed rate of 450 feet/min with a feed rate of 0.01 in/revolution? b) What is the interpretation of  1 = -0.0958? Of  2 = - 667.972? Of  3 = -472.304?

29 Example: Polynomial Regression Suppose the mean daily peak load (MW) for a power plant and the maximum outdoor temperature ( o F) for a sample of 10 days is given below. a)What is the estimated regression line using a quadratic regression model (besides the equation of the line, include the values of adj. r 2 and s e ? b)Using the line, predict the required peak power if the temperature is 98 o F? x i ( o F)958290819910093959397 y i (MW) 214152156129254266210204213150

30 Example: Polynomial Regression (SAS) data newpower; set power; temp2 = temp*temp; proc reg data=newpower; model load=temp temp2; output out=fit r=res; run; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 18089 9044.26725 53.88 <.0001 Error 7 1175.06549 167.86650 Corrected Total 9 19264 Root MSE 12.95633 R-Square 0.9390 Dependent Mean 194.80000 Adj R-Sq 0.9216 Coeff Var 6.65109 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1784.18833 944.12303 1.89 0.1007 temp 1 -42.38624 21.00079 -2.02 0.0833 temp2 1 0.27216 0.11634 2.34 0.0519

31 Example: Polynomial Regression (cont) b) Using the line, predict the required peak power if the temperature is 98 o F? lineadj r 2 sese linearŷ=-419.8+6.7 temp0.8816.18 quadraticŷ=1784-42.4temp+0.27temp 2 0.9212.96

32 Residual Plots

33 Interaction Effect

34 I love statistics! Thank you for not eating me!

35 Example: Multiple Regression Qualitative Predictors A study is conducted to determine the effects of x 1 = company size and x 2 = the presence (1) or absence (0) of a safety program on y = the number of work hours lost due to work-related accidents (thousands). 20 companies with no active safety programs were randomly chosen and 20 companies with active safety programs were randomly chosen. The SAS file (qualpred.txt) is on the class notes web site. The estimated regression line isqualpred.txt y ̂ = 31.6244 + 0.01428 x 1 – 58.0779 x 2 + e What are the interpretations of  1 = 0.01428 and  2 = - 58.0779?

36 Conceptual Understanding X1 X2 X3 Total Variation of Y

37 ANOVA table - MRR SourcedfSSMS Model (Regression) k SSM (from data) Errorn – k - 1 SSE (from data) Totaln - 1 SST (from data)

38 Example: Multiple Linear Regression It is important to know how long a tool will last (min) in the industrial setting. The cutting tool in this study is used to cut a particular type and size of cold-rolled steel. The predictors of interest are x 1 = cutting speed (feet/min), x 2 = feed rate (in/revolution) and x 3 = depth of cut (in). a) Is there a useful linear relationship between the cutting tool lifetime and the predictors?

39 Example: MLR (cont) Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 2743.82814 914.60938 20.93 <.0001 Error 20 874.13019 43.70651 Corrected Total 23 3617.95833 Root MSE 6.61109 R-Square 0.7584 Dependent Mean 38.54167 Adj R-Sq 0.7222 Coeff Var 17.15310 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 101.76536 8.33310 12.21 <.0001 speed 1 -0.09578 0.01426 -6.72 <.0001 feed 1 -667.97241 386.23081 -1.73 0.0991 depth 1 -472.30426 161.81434 -2.92 0.0085

40 Example: MLR (cont) Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 2743.82814 914.60938 20.93 <.0001 Error 20 874.13019 43.70651 Corrected Total 23 3617.95833 Root MSE 6.61109 R-Square 0.7584 Dependent Mean 38.54167 Adj R-Sq 0.7222 Coeff Var 17.15310 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 101.76536 8.33310 12.21 <.0001 speed 1 -0.09578 0.01426 -6.72 <.0001 feed 1 -667.97241 386.23081 -1.73 0.0991 depth 1 -472.30426 161.81434 -2.92 0.0085

41 Conceptual Understanding X1 X2 X3 Total Variation of Y

42 Example: MLR (backwards elimination) Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 2743.82814 914.60938 20.93 <.0001 Error 20 874.13019 43.70651 Corrected Total 23 3617.95833 Root MSE 6.61109 R-Square 0.7584 Dependent Mean 38.54167 Adj R-Sq 0.7222 Coeff Var 17.15310 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 101.76536 8.33310 12.21 <.0001 speed 1 -0.09578 0.01426 -6.72 <.0001 feed 1 -667.97241 386.23081 -1.73 0.0991 depth 1 -472.30426 161.81434 -2.92 0.0085

43 Example: MLR (backwards elimination) (cont) Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 2613.09992 1306.54996 27.30 <.0001 Error 21 1004.85841 47.85040 Corrected Total 23 3617.95833 Root MSE 6.91740 R-Square 0.7223 Dependent Mean 38.54167 Adj R-Sq 0.6958 Coeff Var 17.94784 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 95.88869 7.96137 12.04 <.0001 speed 1 -0.09543 0.01492 -6.40 <.0001 depth 1 -500.32482 168.46077 -2.97 0.0073

44 Example: MLR (backwards elimination) (cont) fullw/o feed line y = 101.77 – 0.096 speed - 667.97 feed – 472.30 depth y = 95.89 – 0.095 speed – 500.32 depth R2R2 0.75840.7223 adj R 2 0.72220.6958 ANOVA table Model 3 2743.83 914.61 Error 20 874.13 43.71 Total 23 3617.96 Model 2 2613.10 1306.55 Error 21 1004.86 47.85 Total 23 3617.96

45 Example: MLR (backwards elimination) (cont) fullfull - Pw/ow/o - P F – test20.93<0.000127.30<0.0001 t-tests:speed-6.72<0.0001-6.40<0.0001 t- tests: feed-1.730.0991 t-tests:depth-2.920.0085-2.970.0073


Download ppt "Chapter 11: Inferential methods in Regression and Correlation"

Similar presentations


Ads by Google