Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Slides:



Advertisements
Similar presentations
X Treatment population Control population 0 Examples: Drug vs. Placebo, Drugs vs. Surgery, New Tx vs. Standard Tx Let X = decrease (–) in cholesterol.
Advertisements

Analysis of variance and statistical inference.
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Deriving Biological Inferences From Epidemiologic Studies.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Departments of Medicine and Biostatistics
Soc 3306a: Path Analysis Using Multiple Regression and Path Analysis to Model Causality.
Chance, bias and confounding
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Bivariate Regression CJ 526 Statistical Analysis in Criminal Justice.
Common Problems in Writing Statistical Plan of Clinical Trial Protocol Liying XU CCTER CUHK.
Lecture 6: Multiple Regression
Correlational Methods and Statistics. Correlation  Nonexperimental method that describes a relationship between two variables.  Allow us to make predictions.
Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis.
Brown, Suter, and Churchill Basic Marketing Research (8 th Edition) © 2014 CENGAGE Learning Basic Marketing Research Customer Insights and Managerial Action.
Cohort Studies Hanna E. Bloomfield, MD, MPH Professor of Medicine Associate Chief of Staff, Research Minneapolis VA Medical Center.
Linear Regression and Correlation Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and the level of.
Assessing Survival: Cox Proportional Hazards Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Regression and Correlation
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Repeated measures: Approaches to Analysis Peter T. Donnan Professor of Epidemiology and Biostatistics.
Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatisticianhttp://research.LABioMed.org/Biostat 1.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Basic Statistics. Basics Of Measurement Sampling Distribution of the Mean: The set of all possible means of samples of a given size taken from a population.
Research Study Design and Analysis for Cardiologists Nathan D. Wong, PhD, FACC.
Moderation & Mediation
Lecture 17 (Oct 28,2004)1 Lecture 17: Prevention of bias in RCTs Statistical/analytic issues in RCTs –Measures of effect –Precision/hypothesis testing.
BC Jung A Brief Introduction to Epidemiology - IX (Epidemiologic Research Designs: Case-Control Studies) Betty C. Jung, RN, MPH, CHES.
Assessing Survival: Cox Proportional Hazards Model
MULTIPLE REGRESSION Using more than one variable to predict another.
Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 5: Methods for Assessing Associations.
X Treatment population Control population 0 Examples: Drug vs. Placebo, Drugs vs. Surgery, New Tx vs. Standard Tx  Let X =  cholesterol level (mg/dL);
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
Dundee Epidemiology and Biostatistics Unit Correlation and Regression Peter T. Donnan Professor of Epidemiology and Biostatistics.
Causality and confounding variables Scientists aspire to measure cause and effect Correlation does not imply causality. Hume: contiguity + order (cause.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
What has Statistics ever done for you? Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Statistical Inference for more than two groups Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
Research Techniques Made Simple: Multivariable Analysis Marlies Wakkee Loes Hollestein Tamar Nijsten Department of Dermatology, Erasmus University Medical.
Prognosis study EBM questions. Prognostic factors Characteristics of patient that may predict eventual outcome Several types: demographic (eg age) disease-specific.
Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.
Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatisticianhttp://research.LABioMed.org/Biostat 1.
Design of Clinical Research Studies ASAP Session by: Robert McCarter, ScD Dir. Biostatistics and Informatics, CNMC
Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Week of March 23 Partial correlations Semipartial correlations
Lecture 3 (Chapter 4). Linear Models for Longitudinal Data Linear Regression Model (Review) Ordinary Least Squares (OLS) Maximum Likelihood Estimation.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Repeated measures: Approaches to Analysis
Multiple Regression.
Generalized linear models
Multiple Regression Prof. Andy Field.
Notes on Logistic Regression
Statistical Inference for more than two groups
Chapter 14: Correlation and Regression
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Multiple Regression.
Linear Model Selection and regularization
Combined predictor Selection for Multiple Clinical Outcomes Using PHREG Grisell Diaz-Ramirez.
Regression Analysis.
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Presentation transcript:

Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Objectives of session Recognise the need for multiple regression Recognise the need for multiple regression Understand methods of selecting variables Understand methods of selecting variables Understand strengths and weakness of selection methods Understand strengths and weakness of selection methods Carry out Multiple Carry out Multiple Regression in SPSS and interpret the output

Why do we need multiple regression? Research is not as simple as effect of one variable on one outcome, Especially with observational data Need to assess many factors simultaneously; more realistic models

Consider Fitted line of y = a + b 1 x 1 + b 2 x 2 Explanatory (x 1 ) Dependent (y) Explanatory (x 2 )

3-dimensional scatterplot from SPSS of Min LDL in relation to baseline LDL and age

When to use multiple regression modelling (1) Assess relationship between two variables while adjusting or allowing for another variable Sometimes the second variable is considered a ‘nuisance’ factor Example: Physical Activity allowing for age and medications

When to use multiple regression modelling (2) In RCT whenever there is imbalance between arms of the trial at baseline in characteristics of subjects e.g. survival in colorectal cancer on two different randomised therapies adjusted for age, gender, stage, and co-morbidity at baseline

When to use multiple regression modelling (2) A special case of this is when adjusting for baseline level of the primary outcome in an RCT Baseline level added as a factor in regression model This will be covered in Trials part of the course

When to use multiple regression modelling (3) With observational data in order to produce a prognostic equation for future prediction of risk of mortality e.g. Predicting future risk of CHD used 10-year data from the Framingham cohort

When to use multiple regression modelling (4) With observational designs in order to adjust for possible confounders e.g. survival in colorectal cancer in those with hypertension adjusted for confounders of age, gender, social deprivation and co-morbidity

Definition of Confounding A confounder is a factor which is related to both the variable of interest (explanatory) and the outcome, but is not an intermediary in a causal pathway

Example of Confounding Deprivation Lung Cancer Smoking

But, also worth adjusting for factors only related to outcome Deprivation Lung Cancer Exercise

Not worth adjusting for intermediate factor in a causal pathway Exercise Stroke Blood viscosity In a causal pathway each factor is merely a marker of the other factors i.e correlated - collinearity

SPSS: Add both baseline LDL and age in the independent box in linear regression

Output from SPSS linear regression on ONLY Age at baseline

Output from SPSS linear regression on ONLY Baseline LDL

Output: Multiple regression R 2 now improved to 13% Both variables still significant INDEPENDENTLY of each other

How do you select which variables to enter the model? Usually consider what hypotheses are you testing?Usually consider what hypotheses are you testing? If main ‘exposure’ variable, enter first and assess confounders one at a timeIf main ‘exposure’ variable, enter first and assess confounders one at a time For derivation of CPR you want powerful predictorsFor derivation of CPR you want powerful predictors Also clinically important factors e.g. cholesterol in CHD predictionAlso clinically important factors e.g. cholesterol in CHD prediction Significance is important butSignificance is important but It is acceptable to have an ‘important’ variable without statistical significanceIt is acceptable to have an ‘important’ variable without statistical significance

How do you decide what variables to enter in model? Correlations? With great difficulty!

3-dimensional scatterplot from SPSS of Time from Surgery in relation to Duke’s staging and age

Approaches to model building 1. Let Scientific or Clinical factors guide selection 2. Use automatic selection algorithms 3. A mixture of above

1) Let Science or Clinical factors guide selection Baseline LDL cholesterol is an important factor determining LDL outcome so enter first Next allow for age and gender Add adherence as important? Add BMI and smoking?

1) Let Science or Clinical factors guide selection Results in model of: 1.Baseline LDL 2.age and gender 3.Adherence 4.BMI and smoking Is this a ‘good’ model?

1) Let Science or Clinical factors guide selection: Final Model Note three variables entered but not statistically significant

1) Let Science or Clinical factors guide selection Is this the ‘best’ model? Should I leave out the non-significant factors (Model 2)? ModelAdj R 2 F from ANOVA No. of Parameters p Adj R 2 lower, F has increased and number of parameters is less in 2 nd model. Is this better?

Kullback-Leibler Information Kullback and Leibler (1951) quantified the meaning of ‘information’ – related to Fisher’s ‘sufficient statistics’ Basically we have reality f And a model g to approximate f So K-L information is I(f,g) f g

Kullback-Leibler Information We want to minimise I (f,g) to obtain the best model over other models I (f,g) is the information lost or ‘distance’ between reality and a model so need to minimise:

Akaike’s Information Criterion It turns out that the function I(f,g) is related to a very simple measure of goodness- of-fit: Akaike’s Information Criterion or AIC

Selection Criteria With a large number of factors type 1 error large, likely to have model with many variablesWith a large number of factors type 1 error large, likely to have model with many variables Two standard criteria:Two standard criteria: 1) Akaike’s Information Criterion (AIC) 2) Schwartz’s Bayesian Information Criterion (BIC) Both penalise models with large number of variables if sample size is largeBoth penalise models with large number of variables if sample size is large

Akaike’s Information Criterion Where p = number of parameters and - 2*log likelihood is in the outputWhere p = number of parameters and - 2*log likelihood is in the output Hence AIC penalises models with large number of variables Hence AIC penalises models with large number of variables Select model that minimises (-2LL+2p) Select model that minimises (-2LL+2p)