Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15.
Analysis of variance and statistical inference.
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Statistical Techniques I EXST7005 Multiple Regression.
Third training Module, EpiSouth: Multivariate analysis, 15 th to 19 th June 20091/29 Multivariate analysis: Introduction Third training Module EpiSouth.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Departments of Medicine and Biostatistics
Multiple Linear Regression
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.
BA 555 Practical Business Analysis
Common Problems in Writing Statistical Plan of Clinical Trial Protocol Liying XU CCTER CUHK.
Lecture 6: Multiple Regression
Multiple Regression.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Assessing Survival: Cox Proportional Hazards Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Regression and Correlation
Correlation & Regression
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Simple Linear Regression
Moderation & Mediation
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician Session 4: Regression Models and Multivariate Analyses.
Assessing Survival: Cox Proportional Hazards Model
MULTIPLE REGRESSION Using more than one variable to predict another.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 5: Methods for Assessing Associations.
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Practical Missing Data Analysis in SPSS (v17 onwards) Peter T. Donnan Professor of Epidemiology and Biostatistics.
Chapter 10 Correlation and Regression
Dundee Epidemiology and Biostatistics Unit Correlation and Regression Peter T. Donnan Professor of Epidemiology and Biostatistics.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
1 G Lect 14M Review of topics covered in course Mediation/Moderation Statistical power for interactions What topics were not covered? G Multiple.
Statistical Inference for more than two groups Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
B AD 6243: Applied Univariate Statistics Multiple Regression Professor Laku Chidambaram Price College of Business University of Oklahoma.
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce.
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Remember the equation of a line: Basic Linear Regression As scientists, we find it an irresistible temptation to put a straight line though something that.
Canadian Bioinformatics Workshops
Stats Methods at IC Lecture 3: Regression.
Generalized linear models
Multiple Regression Prof. Andy Field.
Statistics in MSmcDESPOT
Statistical Inference for more than two groups
Chapter 14: Correlation and Regression
Common Problems in Writing Statistical Plan of Clinical Trial Protocol
Linear Model Selection and regularization
Lecture 12 Model Building
Combined predictor Selection for Multiple Clinical Outcomes Using PHREG Grisell Diaz-Ramirez.
Presentation transcript:

Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Objectives of session Recognise the need for multiple regression Recognise the need for multiple regression Understand methods of selecting variables Understand methods of selecting variables Understand strengths and weakness of selection methods Understand strengths and weakness of selection methods Carry out Multiple Carry out Multiple Regression in SPSS and interpret the output

Why do we need multiple regression? Research is not as simple as effect of one variable on one outcome, Especially with observational data Need to assess many factors simultaneously; more realistic models

Consider Fitted line of y = a + b 1 x 1 + b 2 x 2 Explanatory (x 1 ) Dependent (y) Explanatory (x 2 )

3-dimensional scatterplot from SPSS of Min LDL in relation to baseline LDL and age

When to use multiple regression modelling (1) Assess relationship between two variables while adjusting or allowing for another variable Sometimes the second variable is considered a ‘nuisance’ factor Example: Physical Activity allowing for age and medications

When to use multiple regression modelling (2) In RCT whenever there is imbalance between arms of the trial at baseline in characteristics of subjects e.g. survival in colorectal cancer on two different randomised therapies adjusted for age, gender, stage, and co-morbidity

When to use multiple regression modelling (2) A special case of this is when adjusting for baseline level of the primary outcome in an RCT Baseline level added as a factor in regression model This will be covered in Trials part of the course

When to use multiple regression modelling (3) With observational data in order to produce a prognostic equation for future prediction of risk of mortality e.g. Predicting future risk of CHD used 10-year data from the Framingham cohort

When to use multiple regression modelling (4) With observational data in order to adjust for possible confounders e.g. survival in colorectal cancer in those with hypertension adjusted for age, gender, social deprivation and co-morbidity

Definition of Confounding A confounder is a factor which is related to both the variable of interest (explanatory) and the outcome, but is not an intermediary in a causal pathway

Example of Confounding Deprivation Lung Cancer Smoking

But, also worth adjusting for factors only related to outcome Deprivation Lung Cancer Exercise

Not worth adjusting for intermediate factor in a causal pathway Exercise Stroke Blood viscosity In a causal pathway each factor is merely a marker of the other factors i.e correlated - collinearity

SPSS: Add both baseline LDL and age in the independent box in linear regression

Output from SPSS linear regression on Age at baseline

Output from SPSS linear regression on Baseline LDL

Output: Multiple regression R 2 now improved to 13% Both variables still significant INDEPENDENTLY of each other

How do you select which variables to enter the model? Usually consider what hypotheses are you testing?Usually consider what hypotheses are you testing? If main ‘exposure’ variable, enter first and assess confounders one at a timeIf main ‘exposure’ variable, enter first and assess confounders one at a time For derivation of CPR you want powerful predictorsFor derivation of CPR you want powerful predictors Also clinically important factors e.g. cholesterol in CHD predictionAlso clinically important factors e.g. cholesterol in CHD prediction Significance is important butSignificance is important but It is acceptable to have an ‘important’ variable without statistical significanceIt is acceptable to have an ‘important’ variable without statistical significance

How do you decide what variables to enter in model? Correlations? With great difficulty!

3-dimensional scatterplot from SPSS of Time from Surgery in relation to Duke’s staging and age

Approaches to model building 1. Let Scientific or Clinical factors guide selection 2. Use automatic selection algorithms 3. A mixture of above

1) Let Science or Clinical factors guide selection Baseline LDL cholesterol is an important factor determining LDL outcome so enter first Next allow for age and gender Add adherence as important? Add BMI and smoking?

1) Let Science or Clinical factors guide selection Results in model of: 1.Baseline LDL 2.age and gender 3.Adherence 4.BMI and smoking Is this a ‘good’ model?

1) Let Science or Clinical factors guide selection: Final Model Note three variables entered but not statistically significant

1) Let Science or Clinical factors guide selection Is this the ‘best’ model? Should I leave out the non-significant factors (Model 2)? ModelAdj R 2 F from ANOVA No. of Parameters p Adj R 2 lower, F has increased and number of parameters is less in 2 nd model. Is this better?

Kullback-Leibler Information Kullback and Leibler (1951) quantified the meaning of ‘information’ – related to Fisher’s ‘sufficient statistics’ Basically we have reality f And a model g to approximate f So K-L information is I(f,g) f g

Kullback-Leibler Information We want to minimise I (f,g) to obtain the best model over other models I (f,g) is the information lost or ‘distance’ between reality and a model so need to minimise:

Akaike’s Information Criterion It turns out that the function I(f,g) is related to a very simple measure of goodness- of-fit: Akaike’s Information Criterion or AIC

Selection Criteria With a large number of factors type 1 error large, likely to have model with many variablesWith a large number of factors type 1 error large, likely to have model with many variables Two standard criteria:Two standard criteria: 1) Akaike’s Information Criterion (AIC) 2) Schwartz’s Bayesian Information Criterion (BIC) Both penalise models with large number of variables if sample size is largeBoth penalise models with large number of variables if sample size is large

Akaike’s Information Criterion Where p = number of parameters and - 2*log likelihood is in the outputWhere p = number of parameters and - 2*log likelihood is in the output Hence AIC penalises models with large number of variables Hence AIC penalises models with large number of variables Select model that minimises (-2LL+2p) Select model that minimises (-2LL+2p)

Generalized linear models Unfortunately the standard REGRESSION in SPSS does not give these statisticsUnfortunately the standard REGRESSION in SPSS does not give these statistics Need to useNeed to useAnalyze Generalized Linear Models…..

Generalized linear models. Default is linear Add Min LDL achieved as dependent as in REGRESSION in SPSSAdd Min LDL achieved as dependent as in REGRESSION in SPSS Next go to predictors…..Next go to predictors…..

Generalized linear models: Predictors WARNING!WARNING! Make sure you add the predictors in the correct boxMake sure you add the predictors in the correct box Categorical in FACTORS boxCategorical in FACTORS box Continuous in COVARIATES boxContinuous in COVARIATES box

Generalized linear models: Model Add all factors and covariates in the model as main effectsAdd all factors and covariates in the model as main effects

Generalized Linear Models Parameter Estimates Note identical to REGRESSION output

Generalized Linear Models Goodness-of-fit Note output gives log likelihood and AIC = 2835 (AIC = -2x x7= 2835) Footnote explains smaller AIC is ‘better’

Let Science or Clinical factors guide selection: ‘Optimal’ model The log likelihood is a measure of GOODNESS-OF-FIT The log likelihood is a measure of GOODNESS-OF-FIT Seek ‘optimal’ model that maximises the log likelihood or minimises the AIC Seek ‘optimal’ model that maximises the log likelihood or minimises the AIC Model2LLp AIC 1 Full Model Non-significant variables removed Change is 1.6

1) Let Science or Clinical factors guide selection Key points: 1.Results demonstrate a significant association with baseline LDL, Age and Adherence 2.Difficult choices with Gender, smoking and BMI 3.AIC only changes by 1.6 when removed 4.Generally changes of 4 or more in AIC are considered important

1) Let Science or Clinical factors guide selection Key points: 1.Conclude little to chose between models 2.AIC actually lower with larger model and consider Gender, and BMI important factors so keep larger model but have to justify 3.Model building manual, logical, transparent and under your control

2) Use automatic selection procedures These are based on automatic mechanical algorithms usually related to statistical significance Common ones are stepwise, forward or backward elimination Can be selected in SPSS using ‘Method’ in dialogue box

2) Use automatic selection procedures (e.g Stepwise) Select Method = Stepwise

2) Use automatic selection procedures (e.g Stepwise) Final Model 1 st step 2nd step

2) Change in AIC with Stepwise selection Note: Only available from Generalized Linear Models StepModelLog Likelihood AICChange in AIC No. of Parameters p 1Baseline LDL Adherence Age

2) Advantages and disadvantages of stepwise Advantages Simple to implement Gives a parsimonious model Selection is certainly objective Disadvantages Non stable selection – stepwise considers many models that are very similar P-value on entry may be smaller once procedure is finished so exaggeration of p-value Predictions in external dataset usually worse for stepwise procedures

2) Automatic procedures: Backward elimination Backward starts by eliminating the least significant factor form the full model and has a few advantages over forward: Modeller has to consider the ‘full’ model and sees results for all factors simultaneously Modeller has to consider the ‘full’ model and sees results for all factors simultaneously Correlated factors can remain in the model (in forward methods they may not even enter) Correlated factors can remain in the model (in forward methods they may not even enter) Criteria for removal tend to be more lax in backward so end up with more parameters Criteria for removal tend to be more lax in backward so end up with more parameters

2) Use automatic selection procedures (e.g Backward) Select Method = Backward

2) Backward elimination in SPSS Final Model 1 st step Gender removed 2nd step BMI removed

Summary of automatic selection Automatic selection may not give ‘optimal’ model (may leave out important factors) Automatic selection may not give ‘optimal’ model (may leave out important factors) Different methods may give different results (forward vs. backward elimination) Different methods may give different results (forward vs. backward elimination) Backward elimination preferred as less stringent Backward elimination preferred as less stringent Too easily fitted in SPSS! Too easily fitted in SPSS! Model assessment still requires some thought Model assessment still requires some thought

3) A mixture of automatic procedures and self selection Use automatic procedures as a guide Use automatic procedures as a guide Think about what factors are important Think about what factors are important Add ‘important’ factors Add ‘important’ factors Do not blindly follow statistical significance Do not blindly follow statistical significance Consider AIC Consider AIC

Summary of Model selection Selection of factors for Multiple Linear regression models requires some judgement Selection of factors for Multiple Linear regression models requires some judgement Automatic procedures are available but treat results with caution Automatic procedures are available but treat results with caution They are easily fitted in SPSS They are easily fitted in SPSS Check AIC or log likelihood for fit Check AIC or log likelihood for fit

Summary Multiple regression models are the most used analytical tool in quantitative research Multiple regression models are the most used analytical tool in quantitative research They are easily fitted in SPSS They are easily fitted in SPSS Model assessment requires some thought Model assessment requires some thought Parsimony is better – Occam’s Razor Parsimony is better – Occam’s Razor

Remember Occam’s Razor ‘Entia non sunt multiplicanda praeter necessitatem’ ‘Entities must not be multiplied beyond necessity’ William of Ockham 14 th century Friar and logician

Summary After fitting any model check assumptions Functional form – linearity or not Functional form – linearity or not Check Residuals for normality Check Residuals for normality Check Residuals for outliers Check Residuals for outliers All accomplished within SPSS All accomplished within SPSS See publications for further info See publications for further info Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. Pharmacogenetics and Genomics, 2008; 18: Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. Pharmacogenetics and Genomics, 2008; 18:

Practical on Multiple Regression Read in ‘LDL Data.sav’ 1) Try fitting multiple regression model on Min LDL obtained using forward and backward elimination. Are the results the same? Add other factors than those considered in the presentation such as BMI, smoking. Remember the goal is to assess the association of APOE with LDL response. 2) Try fitting multiple regression models for Min Chol achieved. Is the model similar to that found for Min Chol?