1 Experimental Statistics - week 14 Multiple Regression – miscellaneous topics.

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture #19 Analysis of Designs with Random Factor Levels.
EPI 809/Spring Probability Distribution of Random Error.
Generalized Linear Models (GLM)
Creating Graphs on Saturn GOPTIONS DEVICE = png HTITLE=2 HTEXT=1.5 GSFMODE = replace; PROC REG DATA=agebp; MODEL sbp = age; PLOT sbp*age; RUN; This will.
Chapter 13 Multiple Regression
Multiple regression analysis
ANOVA notes NR 245 Austin Troy
Chapter 12 Simple Regression
Be humble in our attribute, be loving and varying in our attitude, that is the way to live in heaven.
Chapter 12 Multiple Regression
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
Lesson #23 Analysis of Variance. In Analysis of Variance (ANOVA), we have: H 0 :  1 =  2 =  3 = … =  k H 1 : at least one  i does not equal the others.
ANalysis Of VAriance (ANOVA) Comparing > 2 means Frequently applied to experimental data Why not do multiple t-tests? If you want to test H 0 : m 1 = m.
Lecture 9: One Way ANOVA Between Subjects
1 Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Consider the variable.
Chapter 11 Multiple Regression.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
Chapter 13: Inference in Regression
1 Experimental Statistics - week 7 Chapter 15: Factorial Models (15.5) Chapter 17: Random Effects Models.
1 1 Slide © 2005 Thomson/South-Western Chapter 13, Part A Analysis of Variance and Experimental Design n Introduction to Analysis of Variance n Analysis.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 13 Experimental Design and Analysis of Variance nIntroduction to Experimental Design.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
1 Experimental Statistics - week 6 Chapter 15: Randomized Complete Block Design (15.3) Factorial Models (15.5)
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
Analysis of Covariance ANOVA is a class of statistics developed to evaluate controlled experiments. Experimental control, random selection of subjects,
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
 The idea of ANOVA  Comparing several means  The problem of multiple comparisons  The ANOVA F test 1.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Regression Examples. Gas Mileage 1993 SOURCES: Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union. PACE New.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Randomized Block Design (Kirk, chapter 7) BUSI 6480 Lecture 6.
5-5 Inference on the Ratio of Variances of Two Normal Populations The F Distribution We wish to test the hypotheses: The development of a test procedure.
Xuhua Xia Polynomial Regression A biologist is interested in the relationship between feeding time and body weight in the males of a mammalian species.
Lab 5 instruction.  a collection of statistical methods to compare several groups according to their means on a quantitative response variable  Two-Way.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
The Completely Randomized Design (§8.3)
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Chapter 13 Multiple Regression
Three Statistical Issues (1) Observational Study (2) Multiple Comparisons (3) Censoring Definitions.
Topic 25: Inference for Two-Way ANOVA. Outline Two-way ANOVA –Data, models, parameter estimates ANOVA table, EMS Analytical strategies Regression approach.
PSYC 3030 Review Session April 19, Housekeeping Exam: –April 26, 2004 (Monday) –RN 203 –Use pencil, bring calculator & eraser –Make use of your.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
1 Experimental Statistics - week 9 Chapter 17: Models with Random Effects Chapter 18: Repeated Measures.
Topic 24: Two-Way ANOVA. Outline Two-way ANOVA –Data –Cell means model –Parameter estimates –Factor effects model.
1 Experimental Statistics Spring week 6 Chapter 15: Factorial Models (15.5)
Experimental Statistics - week 3
One-Way Analysis of Variance Recapitulation Recapitulation 1. Comparing differences among three or more subsamples requires a different statistical test.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
1 Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics.
1 An example of a more complex design (a four level nested anova) 0 %, 20% and 40% of a tree’s roots were cut with the purpose to study the influence.
Experimental Statistics - week 9
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
1 Experimental Statistics - week 8 Chapter 17: Mixed Models Chapter 18: Repeated Measures.
1 Experimental Statistics - week 12 Chapter 11: Linear Regression and Correlation Chapter 12: Multiple Regression.
1/54 Statistics Analysis of Variance. 2/54 Statistics in practice Introduction to Analysis of Variance Analysis of Variance: Testing for the Equality.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
1 Experimental Statistics - week 5 Chapter 9: Multiple Comparisons Chapter 15: Randomized Complete Block Design (15.3)
Factorial Experiments
Statistics Analysis of Variance.
Experimental Statistics - Week 4 (Lab)
Experimental Statistics - week 8
Presentation transcript:

1 Experimental Statistics - week 14 Multiple Regression – miscellaneous topics

2 Polynomial Regression: - we looked at this briefly in Lab - basically a multiple regression where the independent variables are powers of a single independent variable - use SAS to compute the independent variables x 2, x 3, …, x p

3 Outlier Detection - there are tests for outliers - throwing away outliers should technically be done only when there is evidence that the values “do not belong”

4 Use of Dummy Variables in Regression

5 Example 6.1, Text page Does a drug retains its potency after 1 year of storage? 2 groups: 1) fresh product 2) product stored for 1 year n = 10 observations from each group -- indep. samples) Fresh Stored Variable measured is potency reading Question: How would you compare groups?

6 1-Factor ANOVA Model where    mean of fresh product    mean of 1-year old product We want to test:

7 data ott269; input type$ y; datalines; F 10.2 F 10.5 F 10.3 F 10.8 F S 9.6 S 9.8 S 9.9 ; proc glm; class type; model y=type; means type/lsd; title 'ANOVA -- Potency Data - page 269 (t-test)'; run;

8 ANOVA -- Potency Data - page 269 (t-test) The GLM Procedure Class Level Information Class Levels Values type 2 F S The GLM Procedure Dependent Variable: y Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE potency Mean Source DF Type I SS Mean Square F Value Pr > F type Source DF Type III SS Mean Square F Value Pr > F type

9 Since p =.0005 we reject and conclude that storage time does make a difference. t Tests (LSD) for y NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 18 Error Mean Square Critical Value of t Least Significant Difference Means with the same letter are not significantly different. t Grouping Mean N type A F B S Fresh product has higher potency on average. Also – estimated difference in means = – 9.83 =.54

10 quantitative Regression analysis – requires the independent variables to be quantitative Let’s consider recoding the group membership variable (i.e. F and S) into the numeric scores: 0 = fresh 1 = stored one year and running a regression analysis with this new “dummy” variable as a “quantitative” independent variable - let’s call the “dummy” variable x. Regression Model

11 data ott269; input x y; datalines; ; proc reg; model y=x; title ‘Regression Analysis -- Potency Data - page 269'; run;

12 The REG Procedure Dependent Variable: y Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x Regression Equation:

13 Note: the regression model On the basis of this model:

14 Dummy Variables with More than 2 Groups Example: Balloon Data - 4 groups

Balloon Data Col observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col inflation time in seconds “Research Question”: Is the average time required to inflate the balloons the same for each color? Recall:

16 GLM Procedure ANOVA --- Balloon Data Dependent Variable: time Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE time Mean Source DF Type I SS Mean Square F Value Pr > F color Analysis using 1-factor ANOVA Model with 4 Groups Grouping Mean N color A (yellow) A A (orange) B (pink) B B (blue) LSD Results

17 Dummy Variables For 4 groups -- 3 dummy variables needed.

Col observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 5 X1 Col. 6 X2 Col. 7 X3 Col inflation time in seconds Balloon Data Set with Dummy Variables:

19 ANOVA --- Balloon Data using Dummy Variables The REG Procedure Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x x x

20 According to the Model:

21 Problem 1

22 Multiple Comparisons for Fixed Effect (Inspection Level) -- Use MSAB in place of MSE where ▪ N denotes the # of observations involved in the computation of a marginal mean ▪ v denotes the df associated with AB interaction Recall: Mixed Model

23 When comparing means using a multiple comparison procedure (i.e. LSD, Bonferroni, etc.) use the MS used in the denominator of the associated F-test SAS always gives multiple comparison results using MSE General Rule Note:

24 PROC GLM; class group ewe week; TITLE 'Ewe Study'; model milk=group ewe(group) week group*week; random ewe(group)/test; means group week/lsd; output out=newe r=resmilk; RUN; Ewe Data – problem 1 The GLM Procedure Dependent Variable: milk Sum of Source DF Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total R-Square Coeff Var Root MSE milk Mean Source DF Type I SS Mean Square F Value Pr > F group <.0001 ewe(group) <.0001 week group*week

25 Ewe Study The GLM Procedure Source Type III Expected Mean Square group Var(Error) + 6 Var(ewe(group)) + Q(group,group*week) ewe(group) Var(Error) + 6 Var(ewe(group)) week Var(Error) + Q(week,group*week) group*week Var(Error) + Q(group*week) Ewe Study The GLM Procedure Tests of Hypotheses for Mixed Model Analysis of Variance Dependent Variable: milk Source DF Type III SS Mean Square F Value Pr > F * group Error Error: MS(ewe(group)) * This test assumes one or more other fixed effects are zero. Source DF Type III SS Mean Square F Value Pr > F ewe(group) <.0001 * week group*week Error: MS(Error) * This test assumes one or more other fixed effects are zero.

26 Ewe Study The GLM Procedure Source Type III Expected Mean Square group Var(Error) + 6 Var(ewe(group)) + Q(group,group*week) ewe(group) Var(Error) + 6 Var(ewe(group)) week Var(Error) + Q(week,group*week) group*week Var(Error) + Q(group*week) Ewe Study The GLM Procedure Tests of Hypotheses for Mixed Model Analysis of Variance Dependent Variable: milk Source DF Type III SS Mean Square F Value Pr > F * group Error Error: MS(ewe(group)) * This test assumes one or more other fixed effects are zero. Source DF Type III SS Mean Square F Value Pr > F ewe(group) <.0001 * week group*week Error: MS(Error) * This test assumes one or more other fixed effects are zero.

27 t Tests (LSD) for Group Differences NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 30 Error Mean Square Critical Value of t Least Significant Difference Means with the same letter are not significantly different. t Grouping Mean N group A B C

28 t Tests (LSD) for Group Differences NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 30 Error Mean Square Critical Value of t Least Significant Difference Means with the same letter are not significantly different. t Grouping Mean N group A B C “Error” Degrees of Freedom = “Error” Mean Square = Critical Value of t = Least Significant Difference = Corrected:

29 residuals milk Ewe Data interaction plot – Milk Production by Week Why non-normal?

30 Ewe Data – Box Plots

31 Problem 2

32 Outliers Removed Kidney Data Original Model Log Model

33 Kidney Data Original ModelLog Model R 2 =.855R 2 =.866

34 Kidney Data Original ModelOutliers Removed R 2 =.855R 2 =.871

35 Kidney Data Log Model R 2 =.866 Log Model – Outliers Removed R 2 =.901

36 Problem 3

37 Original Variables Log Survival vs Other Original Variables Survival Data

38 Original Variables Survival vs Square of Independent Variables Survival Data

39 Dependent Variable: Log(Survival) Number in Adjusted Model R-Square R-Square Variables in Model clot prog enzyme liver clot prog enzyme clot prog enzyme liver age clot prog enzyme age prog enzyme liver prog enzyme liver age prog enzyme prog enzyme age clot enzyme liver age Dependent Variable: Survival Number in Adjusted Model R-Square R-Square Variables in Model clot prog enzyme liver clot prog enzyme liver age clot prog enzyme clot prog enzyme age prog enzyme liver prog enzyme liver age prog enzyme prog enzyme age

40 Survival Data – Log(Survival) Model without Age

41 Grades “Conditional” – under assumption of good performance on next Thursday’s lab Final Exam -- optional (scheduled for 8:00 AM – 11:00 AM Friday, May 6) -- “in class” exam -- will be averaged in equally with the other 2 exams to comprise 75% of grade - can raise or lower final grade From Syllabus GRADE COMPUTATION: Exam Grades (75%) Daily Assignments (25%)

42 W e showed that 1-factor ANOVA can be run using regression analysis with dummy variables. Question: What’s the benefit? Answer: Dummy variables can be mixed in with regular quantitative variables to give a combination of regression and ANOVA analyses. Dummy Variables

43 For 4 groups -- 3 dummy variables needed. Dummy Variables for 4 Groups: 0, 0, 0 → group 1 1, 0, 0 → group 2 0, 1, 0 → group 3 0, 0, 1 → group 4

44 Dummy Variables for 4 Groups:

Balloon Data Col observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col inflation time in seconds “Research Question”: Is the average time required to inflate the balloons the same for each color?

46 GLM Procedure ANOVA --- Balloon Data Dependent Variable: time Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE time Mean Source DF Type I SS Mean Square F Value Pr > F color Analysis using 1-factor ANOVA Model with 4 Groups Grouping Mean N color A (yellow) A A (orange) B (pink) B B (blue) LSD Results

Col observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 5 X1 Col. 6 X2 Col. 7 X3 Col inflation time in seconds Balloon Data Set with Dummy Variables:

48 ANOVA --- Balloon Data using Dummy Variables The REG Procedure Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x x x

49 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x x x (i.e. “pink” ≠ “yellow”) i.e. conclude “pink” ≠ “orange” i.e. we cannot conclude “pink” and “blue” are different

50 Dummy Variables for 4 Groups:

51 Grouping Mean N color A (yellow) A A (orange) B (pink) B B (blue) LSD Results

52 Recall: There was an issue with order in which balloons were inflated - lab assistant “improved” - we tried to account for this by randomizing run order run order inflation time

53 Put “run order” in the model. PROC REG; MODEL y=x1 x2 x3 id; TITLE 'ANOVA --- Balloon Data using Dummy Variables and Run Order'; RUN; t-tests in a MLR model test the effects of individual independent variables while all other independent variables stay constant - in this example, we can test for color effects while “adjusting for” or taking out the effect of run order Another Strategy: Recall:

54 Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x x x id PROC REG ANOVA Table – Balloon Data We can see that: - pink is still significantly different from yellow and orange and not significantly different from blue - there is a significant “run order” effect

55 ANOVA --- Balloon Data with No Randomization The GLM Procedure Dependent Variable: time Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE time Mean Source DF Type I SS Mean Square F Value Pr > F color t Tests (LSD) for time Alpha 0.05 Error Degrees of Freedom 28 Error Mean Square Critical Value of t Least Significant Difference Means with the same letter are not significantly different. t Grouping Mean N color A A B A B B C C C

56 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x x x Non-randomized Balloon Data

57 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x x x id Non-randomized Balloon Data

58 DATA survival; INPUT clot prog enzyme liver age gender alch1 alch2 survival; DATALINES; ; PROC reg; MODEL survival=clot prog enzyme liver age/selection=adjrsq; output out=new r=ressurv p=predsurv; RUN; Survival Data PROC reg; MODEL lgsurv=clot prog enzyme liver age/selection=adjrsq; output out=new r=ressvlg p=predsvlg; RUN; Gender: 0=male, 1=female Alcohol Use alch1 alch2 None 0 0 Moderate 1 0 Heavy 0 1

59 Dependent Variable: survival Number in Adjusted Model R-Square R-Square Variables in Model clot prog enzyme liver alch1 alch clot prog enzyme liver alch clot prog enzyme liver age alch1 alch clot prog enzyme liver gender alch1 alch clot prog enzyme liver age alch clot prog enzyme liver gender alch clot prog enzyme liver age gender alch1 alch clot prog enzyme liver age gender alch clot prog enzyme alch1 alch2 Adjusted R-Square Selection Method Dependent Variable: log(survival) Number in Adjusted Model R-Square R-Square Variables in Model clot prog enzyme liver gender alch clot prog enzyme liver gender alch1 alch clot prog enzyme liver alch clot prog enzyme liver age gender alch clot prog enzyme liver alch1 alch clot prog enzyme liver age gender alch1 alch clot prog enzyme liver age alch clot prog enzyme liver age alch1 alch2

60 Dependent Variable: lgsurv Number of Observations Read 108 Number of Observations Used 108 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 prog <.0001 enzyme <.0001 liver <.0001 gender alch <.0001

61 None: (0,0) mean survival = Moderate: (1,0) mean survival = Severe: (0,1) mean survival = Alcohol Use