Download presentation
Presentation is loading. Please wait.
Published byCora Hunt Modified over 9 years ago
1
Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research
2
Objectives of session Recognise the need for multiple regression Recognise the need for multiple regression Understand methods of selecting variables Understand methods of selecting variables Understand strengths and weakness of selection methods Understand strengths and weakness of selection methods Carry out Multiple Carry out Multiple Regression in SPSS and interpret the output
3
Why do we need multiple regression? Research is not as simple as effect of one variable on one outcome, Especially with observational data Need to assess many factors simultaneously; more realistic models
4
Consider Fitted line of y = a + b 1 x 1 + b 2 x 2 Explanatory (x 1 ) Dependent (y) Explanatory (x 2 )
5
3-dimensional scatterplot from SPSS of Min LDL in relation to baseline LDL and age
6
When to use multiple regression modelling (1) Assess relationship between two variables while adjusting or allowing for another variable Sometimes the second variable is considered a ‘nuisance’ factor Example: Physical Activity allowing for age and medications
7
When to use multiple regression modelling (2) In RCT whenever there is imbalance between arms of the trial at baseline in characteristics of subjects e.g. survival in colorectal cancer on two different randomised therapies adjusted for age, gender, stage, and co-morbidity at baseline
8
When to use multiple regression modelling (2) A special case of this is when adjusting for baseline level of the primary outcome in an RCT Baseline level added as a factor in regression model This will be covered in Trials part of the course
9
When to use multiple regression modelling (3) With observational data in order to produce a prognostic equation for future prediction of risk of mortality e.g. Predicting future risk of CHD used 10-year data from the Framingham cohort
10
When to use multiple regression modelling (4) With observational designs in order to adjust for possible confounders e.g. survival in colorectal cancer in those with hypertension adjusted for confounders of age, gender, social deprivation and co-morbidity
11
Definition of Confounding A confounder is a factor which is related to both the variable of interest (explanatory) and the outcome, but is not an intermediary in a causal pathway
12
Example of Confounding Deprivation Lung Cancer Smoking
13
But, also worth adjusting for factors only related to outcome Deprivation Lung Cancer Exercise
14
Not worth adjusting for intermediate factor in a causal pathway Exercise Stroke Blood viscosity In a causal pathway each factor is merely a marker of the other factors i.e correlated - collinearity
15
SPSS: Add both baseline LDL and age in the independent box in linear regression
16
Output from SPSS linear regression on ONLY Age at baseline
17
Output from SPSS linear regression on ONLY Baseline LDL
18
Output: Multiple regression R 2 now improved to 13% Both variables still significant INDEPENDENTLY of each other
19
How do you select which variables to enter the model? Usually consider what hypotheses are you testing?Usually consider what hypotheses are you testing? If main ‘exposure’ variable, enter first and assess confounders one at a timeIf main ‘exposure’ variable, enter first and assess confounders one at a time For derivation of CPR you want powerful predictorsFor derivation of CPR you want powerful predictors Also clinically important factors e.g. cholesterol in CHD predictionAlso clinically important factors e.g. cholesterol in CHD prediction Significance is important butSignificance is important but It is acceptable to have an ‘important’ variable without statistical significanceIt is acceptable to have an ‘important’ variable without statistical significance
20
How do you decide what variables to enter in model? Correlations? With great difficulty!
21
3-dimensional scatterplot from SPSS of Time from Surgery in relation to Duke’s staging and age
22
Approaches to model building 1. Let Scientific or Clinical factors guide selection 2. Use automatic selection algorithms 3. A mixture of above
23
1) Let Science or Clinical factors guide selection Baseline LDL cholesterol is an important factor determining LDL outcome so enter first Next allow for age and gender Add adherence as important? Add BMI and smoking?
24
1) Let Science or Clinical factors guide selection Results in model of: 1.Baseline LDL 2.age and gender 3.Adherence 4.BMI and smoking Is this a ‘good’ model?
25
1) Let Science or Clinical factors guide selection: Final Model Note three variables entered but not statistically significant
26
1) Let Science or Clinical factors guide selection Is this the ‘best’ model? Should I leave out the non-significant factors (Model 2)? ModelAdj R 2 F from ANOVA No. of Parameters p 10.13737.487 20.13472.0214 Adj R 2 lower, F has increased and number of parameters is less in 2 nd model. Is this better?
27
Kullback-Leibler Information Kullback and Leibler (1951) quantified the meaning of ‘information’ – related to Fisher’s ‘sufficient statistics’ Basically we have reality f And a model g to approximate f So K-L information is I(f,g) f g
28
Kullback-Leibler Information We want to minimise I (f,g) to obtain the best model over other models I (f,g) is the information lost or ‘distance’ between reality and a model so need to minimise:
29
Akaike’s Information Criterion It turns out that the function I(f,g) is related to a very simple measure of goodness- of-fit: Akaike’s Information Criterion or AIC
30
Selection Criteria With a large number of factors type 1 error large, likely to have model with many variablesWith a large number of factors type 1 error large, likely to have model with many variables Two standard criteria:Two standard criteria: 1) Akaike’s Information Criterion (AIC) 2) Schwartz’s Bayesian Information Criterion (BIC) Both penalise models with large number of variables if sample size is largeBoth penalise models with large number of variables if sample size is large
31
Akaike’s Information Criterion Where p = number of parameters and - 2*log likelihood is in the outputWhere p = number of parameters and - 2*log likelihood is in the output Hence AIC penalises models with large number of variables Hence AIC penalises models with large number of variables Select model that minimises (-2LL+2p) Select model that minimises (-2LL+2p)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.