1 Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics.

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Inference for Regression
Experimental design and analyses of experimental data Lesson 2 Fitting a model to data and estimating its parameters.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
EPI 809/Spring Probability Distribution of Random Error.
Creating Graphs on Saturn GOPTIONS DEVICE = png HTITLE=2 HTEXT=1.5 GSFMODE = replace; PROC REG DATA=agebp; MODEL sbp = age; PLOT sbp*age; RUN; This will.
Chapter 13 Multiple Regression
Multiple regression analysis
Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Our theory states.
1 Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Consider the variable.
EPI809/Spring Testing Individual Coefficients.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
Correlation and Regression Analysis
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Module 32: Multiple Regression This module reviews simple linear regression and then discusses multiple regression. The next module contains several examples.
8.1 Ch. 8 Multiple Regression (con’t) Topics: F-tests : allow us to test joint hypotheses tests (tests involving one or more  coefficients). Model Specification:
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
1 Experimental Statistics - week 7 Chapter 15: Factorial Models (15.5) Chapter 17: Random Effects Models.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons Business Statistics, 4e by Ken Black Chapter 15 Building Multiple Regression Models.
Statistics for Everyone Workshop Fall 2010 Part 6B Assessing the Relationship Between More than 2 Numerical Variables Using Correlation and Multiple Regression.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Regression Examples. Gas Mileage 1993 SOURCES: Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union. PACE New.
Basics of Regression Analysis. Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Randomized Block Design (Kirk, chapter 7) BUSI 6480 Lecture 6.
Xuhua Xia Polynomial Regression A biologist is interested in the relationship between feeding time and body weight in the males of a mammalian species.
Why Design? (why not just observe and model?) CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved. About us | Disclaimer.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Chapter 13 Multiple Regression
1 Experimental Statistics - week 14 Multiple Regression – miscellaneous topics.
Xuhua Xia Correlation and Regression Introduction to linear correlation and regression Numerical illustrations SAS and linear correlation/regression –CORR.
Simple Linear Regression. Data available : (X,Y) Goal : To predict the response Y. (i.e. to obtain the fitted response function f(X)) Least Squares Fitting.
Environmental Modeling Basic Testing Methods - Statistics III.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
1 Experimental Statistics - week 9 Chapter 17: Models with Random Effects Chapter 18: Repeated Measures.
1 Experimental Statistics Spring week 6 Chapter 15: Factorial Models (15.5)
Experimental Statistics - week 3
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Experimental Statistics - week 9
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
1 Experimental Statistics - week 8 Chapter 17: Mixed Models Chapter 18: Repeated Measures.
1 Experimental Statistics - week 12 Chapter 11: Linear Regression and Correlation Chapter 12: Multiple Regression.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
1 Experimental Statistics - week 5 Chapter 9: Multiple Comparisons Chapter 15: Randomized Complete Block Design (15.3)
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 15 Multiple Regression Model Building
Regression Models First-order with Two Independent Variables
Interactions Interaction: Does the relationship between two variables depend on a third variable? Does the relationship of age to BP depend on gender Does.
Business Statistics, 4e by Ken Black
Experimental Statistics - Week 4 (Lab)
Experimental Statistics - week 8
Introduction to Regression
Presentation transcript:

1 Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics

2 Setting: We have a dependent variable Y and several candidate independent variables. Question: Should we use all of them?

3 Why do we run Multiple Regression? 1. Obtain estimates of individual coefficients in a model (+ or -, etc.) 2. Screen variables to determine which have a significant effect on the model 3. Arrive at the most effective (and efficient) prediction model

4 The problem: Collinearity among the independent variables -- high correlation between 2 independent variables -- one independent variable nearly a linear combination of other independent variables -- etc.

5 Effects of Collinearity parameter estimates are highly variable and unreliable - parameter estimates may even have the opposite sign from what is reasonable may have significant F but none of the t-tests are significant Variable Selection Techniques Techniques for “being careful” about which variables are put into the model

6 Variable Selection Procedures Forward selection Backward Elimination Stepwise Best subset

7 Multiple Regression – Analysis Suggestions 1.Include only variables that make sense 2.Force imprtant variables into a model 3.Be wary of variable selection results - especially forward selection 4.Examine pairwise correlations among variables 5. Examine pairwise scatterplots among variables - identify nonlinearity - identify unequal variance problems - identify possible outliers 5. Try transformations of variables for - correcting nonlinearity - stabilizing the variances - inducing normality of residuals

8 SPSS Output from INFANT Data Set

9 SPSS Output from CAR Data Set

10 Examples of Nonlinear Data “Shapes” and Linearizing Transformations

11 Original Model  1 > 0  1 < 0 Transformed Into: Exponential Transformation (Log-Linear)

12 Transformed Multiplicative Model (Log-Log)

13  1 > 0  1 < 0 Square Root Transformation

14 Note: - transforming Y using the log or square root transformation can help with unequal variance problems - these transformations may also help induce normality

15 hmpg vs hp hmpg vs sqrt(hp) log(hmpg) vs hp log(hmpg) vs log(hp)

16 Polynomial Regression: - basically a multiple regression where the independent variables are powers of a single independent variable - use SAS to compute the independent variables x 2, x 3, …, x p

17 Outlier Detection - there are tests for outliers - throwing away outliers should technically be done only when there is evidence that the values “do not belong”

18 Use of Dummy Variables in Regression

19 Example 6.1, Text page Does a drug retains its potency after 1 year of storage? 2 groups: 1) fresh product 2) product stored for 1 year n = 10 observations from each group -- indep. samples) Fresh Stored Variable measured is potency reading Question: How would you compare groups?

20 1-Factor ANOVA Model where    mean of fresh product    mean of 1-year old product We want to test: We could use: - independent groups t-test - 1-factor ANOVA (with 2 levels of the factor)

21 data ott269; input type$ y; datalines; F 10.2 F 10.5 F 10.3 F 10.8 F S 9.6 S 9.8 S 9.9 ; proc glm; class type; model y=type; means type/lsd; title 'ANOVA -- Potency Data - page 269 (t-test)'; run;

22 ANOVA -- Potency Data - page 269 (t-test) The GLM Procedure Class Level Information Class Levels Values type 2 F S The GLM Procedure Dependent Variable: y Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE potency Mean Source DF Type I SS Mean Square F Value Pr > F type Source DF Type III SS Mean Square F Value Pr > F type

23 Since p =.0005 we reject and conclude that storage time does make a difference. t Tests (LSD) for y NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 18 Error Mean Square Critical Value of t Least Significant Difference Means with the same letter are not significantly different. t Grouping Mean N type A F B S Fresh product has higher potency on average. Also – estimated difference in means = – 9.83 =.54

24 quantitative Regression analysis – requires the independent variables to be quantitative Let’s consider recoding the group membership variable (i.e. F and S) into the numeric scores: 0 = fresh 1 = stored one year and running a regression analysis with this new “dummy” variable as a “quantitative” independent variable - let’s call the “dummy” variable x. Regression Model

25 data ott269; input x y; datalines; ; proc reg; model y=x; title ‘Regression Analysis -- Potency Data - page 269'; run;

26 The REG Procedure Dependent Variable: y Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x Regression Equation:

27 Note: the regression model On the basis of this model:

28 Dummy Variables with More than 2 Groups Example: Balloon Data - 4 groups

Balloon Data Col observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col inflation time in seconds “Research Question”: Is the average time required to inflate the balloons the same for each color? Recall:

30 GLM Procedure ANOVA --- Balloon Data Dependent Variable: time Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE time Mean Source DF Type I SS Mean Square F Value Pr > F color Analysis using 1-factor ANOVA Model with 4 Groups Grouping Mean N color A (yellow) A A (orange) B (pink) B B (blue) LSD Results

31 Dummy Variables For 4 groups -- 3 dummy variables needed. 0, 0, 0 → group 1 1, 0, 0 → group 2 0, 1, 0 → group 3 0, 0, 1 → group 4

32 Dummy Variables for 4 Groups: The model says: The mean for color 1 (i.e. x 1 = 0, x 2 = 0, x 3 = 0) is    - notation     The mean for color 2 (i.e. x 1 = 1, x 2 = 0, x 3 = 0) is      - notation       The mean for color 3 (i.e. x 1 = 0, x 2 = 1, x 3 = 0) is      - notation       The mean for color 4 (i.e. x 1 = 0, x 2 = 0, x 3 = 1) is      - notation      

33

34 Dummy Variables for 4 Groups:

Col observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 5 X1 Col. 6 X2 Col. 7 X3 Col inflation time in seconds Balloon Data Set with Dummy Variables:

36 ANOVA --- Balloon Data using Dummy Variables The REG Procedure Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x x x

37 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 x x x (i.e. “pink” ≠ “yellow”) i.e. conclude “pink” ≠ “orange” i.e. we cannot conclude “pink” and “blue” are different Grouping Mean N color A (yellow) A A (orange) B (pink) B B (blue) Recall LSD Results

38 W e showed that 1-factor ANOVA can be run using regression analysis with dummy variables. Question: What’s the real benefit of dummy variables? Answer: Dummy variables can be mixed in with quantitative independent variables to give a combination of regression and ANOVA analyses. Dummy Variables

39 study using 108 patients in a surgical unit. researchers interested in predicting the survival time (in days) of patients undergoing a type of liver operation Survival Data clot = blood clotting score prog = prognostic index enzyme = enzyme function test score liver = liver function test score age = age in years gender (0 = male, 1 = female) alch1, alch2 = indicator of alcohol usage None: alch1 = 0, alch2 = 0 Moderate: alch1 = 1, alch2 = 0 Heavy: alch1 = 0, alch2 = 1 Independent Variables

40 DATA survival; INPUT clot prog enzyme liver age gender alch1 alch2 survival; DATALINES; ; PROC reg; MODEL survival=clot prog enzyme liver age/selection=adjrsq; output out=new r=ressurv p=predsurv; RUN; Survival Data PROC reg; MODEL lgsurv=clot prog enzyme liver age/selection=adjrsq; output out=new r=ressvlg p=predsvlg; RUN; Gender: 0=male, 1=female Alcohol Use alch1 alch2 None 0 0 Moderate 1 0 Heavy 0 1

41 Dependent Variable: survival Number in Adjusted Model R-Square R-Square Variables in Model clot prog enzyme liver alch1 alch clot prog enzyme liver alch clot prog enzyme liver age alch1 alch clot prog enzyme liver gender alch1 alch clot prog enzyme liver age alch clot prog enzyme liver gender alch clot prog enzyme liver age gender alch1 alch clot prog enzyme liver age gender alch clot prog enzyme alch1 alch2 Adjusted R-Square Selection Method Dependent Variable: log(survival) Number in Adjusted Model R-Square R-Square Variables in Model clot prog enzyme liver gender alch clot prog enzyme liver gender alch1 alch clot prog enzyme liver alch clot prog enzyme liver age gender alch clot prog enzyme liver alch1 alch clot prog enzyme liver age gender alch1 alch clot prog enzyme liver age alch clot prog enzyme liver age alch1 alch2

42 Dependent Variable: lgsurv Number of Observations Read 108 Number of Observations Used 108 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 clot prog <.0001 enzyme <.0001 liver gender alch < variable model for log(survival) selected by adjusted R 2

43 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 clot prog <.0001 enzyme <.0001 liver alch < variable model for log(survival) selected by Backward Elimination

44 None: (0,0) mean survival = Moderate: (1,0) mean survival = Severe: (0,1) mean survival = What is the role of the variable “alch2” in the model? alch2 =1 implies heavy alch2 = 0 implies none or moderate