Tests for Continuous Outcomes II

Slides:

Advertisements

Similar presentations

Forecasting Using the Simple Linear Regression Model and Correlation

Advertisements

Simple Linear Regression. G. Baker, Department of Statistics University of South Carolina; Slide 2 Relationship Between Two Quantitative Variables If.

Linear correlation and linear regression. Continuous outcome (means) Outcome Variable Are the observations independent or correlated? Alternatives if.

Review of ANOVA and linear regression. Review of simple ANOVA.

LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.

Chapter 12 Multiple Regression

Linear Regression and Correlation Analysis

REGRESSION AND CORRELATION

Introduction to Probability and Statistics Linear Regression and Correlation.

© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.

Tests for Continuous Outcomes II. Overview of common statistical tests Outcome Variable Are the observations correlated? Assumptions independentcorrelated.

Regression and Correlation Methods Judy Zhong Ph.D.

ANALYSIS OF VARIANCE. Analysis of variance ◦ A One-way Analysis Of Variance Is A Way To Test The Equality Of Three Or More Means At One Time By Using.

Chapter 13: Inference in Regression

Statistics in Medicine Unit 8: Overview/Teasers. Overview Regression I: Linear regression.

Simple Linear Regression

Examining Relationships in Quantitative Research

Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.

Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.

Linear correlation and linear regression + summary of tests

© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.

Chapter 16 Data Analysis: Testing for Associations.

Lecture 10: Correlation and Regression Model.

Linear correlation and linear regression + summary of tests Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1.

Stats Methods at IC Lecture 3: Regression.

Chapter 13 Simple Linear Regression

Nonparametric Statistics

University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 5 Multiple Regression

Correlation Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger.

Regression and Correlation

More than two groups: ANOVA and Chi-square

Regression Analysis AGEC 784.

Inference for Least Squares Lines

Statistics for Managers using Microsoft Excel 3rd Edition

Applied Biostatistics: Lecture 2

Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.

Simple Linear Regression

Chapter 11 Simple Regression

Understanding Standards Event Higher Statistics Award

Chapter 13 Simple Linear Regression

Simple Linear Regression

POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.

Simple Linear Regression

Correlation & Linear Regression

Multiple logistic regression

SDPBRN Postgraduate Training Day Dundee Dental Education Centre

Correlation and Regression

Regression Analysis Week 4.

Stats Club Marnie Brennan

Nonparametric Statistics

CHAPTER 26: Inference for Regression

Prepared by Lee Revere and John Large

Scatter Plots of Data with Various Correlation Coefficients

Multiple Regression Models

Ass. Prof. Dr. Mogeeb Mosleh

LEARNING OUTCOMES After studying this chapter, you should be able to

3 4 Chapter Describing the Relation between Two Variables

Product moment correlation

Regression analysis: linear and logistic

CORRELATION AND MULTIPLE REGRESSION ANALYSIS

SIMPLE LINEAR REGRESSION

Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges

MGS 3100 Business Analysis Regression Feb 18, 2016

Chapter 13 Simple Linear Regression

Presentation transcript:

Tests for Continuous Outcomes II

Overview of common statistical tests Outcome Variable Are the observations independent or correlated? Assumptions independent correlated Continuous (e.g. pain scale, cognitive function) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups

Overview of common statistical tests Outcome Variable Are the observations independent or correlated? Assumptions independent correlated Continuous (e.g. pain scale, cognitive function) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups

Continuous outcome (means) Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest Kruskal-Wallis test: non-parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Divalproex vs. placebo for treating bipolar depression Davis et al. “Divalproex in the treatment of bipolar depression: A placebo controlled study.” J Affective Disorders 85 (2005) 259-266.

Repeated-measures ANOVA Statistical question: Do subjects in the treatment group have greater reductions in depression scores over time than those in the control group? What is the outcome variable? Depression score What type of variable is it? Continuous Is it normally distributed? Yes Are the observations correlated? Yes, there are multiple measurements on each person How many time points are being compared? >2  repeated-measures ANOVA

Repeated-measures ANOVA For before and after studies, a paired ttest will suffice. For more than two time periods, you need repeated-measures ANOVA. Serial paired ttests is incorrect, because this strategy will increase your type I error.

Repeated-measures ANOVA Answers the following questions, taking into account the fact the correlation within subjects: Are there significant differences across time periods? Are there significant differences between groups (=your categorical predictor)? Are there significant differences between groups in their changes over time?

Two groups (e.g., treatment placebo) id group time1 time2 time3 time4 1 A 31 29 15 26 2 A 24 28 20 32 3 A 14 20 28 30 4 B 38 34 30 34 5 B 25 29 25 29 6 B 30 28 16 34 Hypothetical data: measurements of depression scores over time in treatment (A) and placebo (B).

Profile plots by group B A

Mean plots by group B A Repeated measures ANOVA tells you if and how these two profile plots differ…

Possible questions… Overall, are there significant differences between time points? From plots: looks like some differences (time3 and 4 look different) Do the two groups differ at any time points? From plots: certainly at baseline; some difference everywhere Do the two groups differ in their responses over time?** From plots: their response profile looks similar over time, though A and B are closer by the end.

repeated-measures ANOVA… Overall, are there significant differences between time points? Time factor Do the two groups differ at any time points? Group factor Do the two groups differ in their responses over time?** Group x time factor

From rANOVA analysis… Overall, are there significant differences between time points? No, Time not statistically significant (p=.1743) Do the two groups differ at any time points? No, Group not statistically significant (p=.1408) Do the two groups differ in their responses over time?** No, not even close; Group*Time (p-value>.60)

rANOVA Time is significant. Group*time is significant. Group is not significant.

rANOVA Time is not significant. Group*time is not significant. Group IS significant.

rANOVA Time is significant. Group is not significant. Time*group is not significant.

Homeopathy vs. placebo in treating pain after surgery Day of surgery Mean pain assessments by visual analogue scales (VAS) p>.05; rANOVA (Group x Time) Days 1-7 after surgery (morning and evening) Copyright ©1995 BMJ Publishing Group Ltd. Lokken, P. et al. BMJ 1995;310:1439-1442

Pint of milk vs. control on bone acquisition in adolescent females Mean (SE) percentage increases in total body bone mineral and bone density over 18 months. P values are for the differences between groups by repeated measures analysis of variance Cadogan, J. et al. BMJ 1997;315:1255-1260 Copyright ©1997 BMJ Publishing Group Ltd.

Counseling vs. control on smoking in pregnancy P<.05; rANOVA Copyright ©2000 BMJ Publishing Group Ltd. Hovell, M. F et al. BMJ 2000;321:337-342

Review Question 1 Repeated-measures ANOVA. One-way ANOVA. In a study of depression, I measured depression score (a continuous, normally distributed variable) at baseline; 1 month; 6 months; and 12 months. What statistical test will best tell me whether or not depression improved between baseline and the end of the study? Repeated-measures ANOVA. One-way ANOVA. Two-sample ttest. Paired ttest. Wilcoxon sum-rank test.

Review Question 2 Repeated-measures ANOVA. One-way ANOVA. In the same depression study, what statistical test will best tell me whether or not two treatments for depression had different effects over time? Repeated-measures ANOVA. One-way ANOVA. Two-sample ttest. Paired ttest. Wilcoxon sum-rank test.

Continuous outcome (means) Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest Kruskal-Wallis test: non-parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Political Leanings and Rating of Obama Example: class data Political Leanings and Rating of Obama r=0.39148, p=.07

Political Leanings and Rating of Health Care Law Example: class data Political Leanings and Rating of Health Care Law r= -0.00768, p=.97

Example 2: pain and injection pressure r=.75, p<.0001

Correlation coefficient Statistical question: Is injection pressure related to pain? What is the outcome variable? VAS pain score What type of variable is it? Continuous Is it normally distributed? Yes Are the observations correlated? No Are groups being compared? No—the independent variable is also continuous  correlation coefficient

New concept: Covariance

Interpreting Covariance Covariance between two random variables: cov(X,Y) > 0 X and Y tend to move in the same direction cov(X,Y) < 0 X and Y tend to move in opposite directions cov(X,Y) = 0 X and Y are independent

Correlation coefficient Pearson’s Correlation Coefficient is standardized covariance (unitless):

Corrrelation Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship

Scatter Plots of Data with Various Correlation Coefficients Y Y Y X X X r = -1 r = -.6 r = 0 Y Y Y X X X r = +1 r = +.3 r = 0 ** Next 4 slides from “Statistics for Managers”4th Edition, Prentice-Hall 2004

Linear Correlation Linear relationships Curvilinear relationships Y Y X X Y Y X X

Linear Correlation Strong relationships Weak relationships Y Y X X Y Y

Linear Correlation No relationship Y X Y X

Recall: correlation coefficient (large n) Hypothesis test: Confidence Interval

Correlation coefficient (small n) Hypothesis test: Confidence Interval

Review Problem 3 What’s a good guess for the Pearson’s correlation coefficient (r) for this scatter plot? –1.0 +1.0 -.5 -.1

Continuous outcome (means) Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest Kruskal-Wallis test: non-parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Political Leanings and Rating of Obama Example: class data Political Leanings and Rating of Obama Expected Obama Rating = 50.5 + 0.28*politics.

Example 2: pain and injection pressure R-squared = correlation coefficient squared. Meaning: the percent of variance in Y that is “explained by” X.

Simple linear regression Statistical question: Does injection pressure “predict” pain? What is the outcome variable? VAS pain score What type of variable is it? Continuous Is it normally distributed? Yes Are the observations correlated? No Are groups being compared? No—the independent variable is also continuous  simple linear regression

Linear regression In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

What is “Linear”? Remember this: Y=mX+B? m B

What’s Slope? A slope of 0.28 means that every 1-unit change in X yields a .28-unit change in Y.

Simple linear regression Intercept (x=0), not shown on graph The linear regression model: Ratings of Obama = 50.5 + 0.28*(political bent) slope

Simple linear regression Wake-up Time versus Exercise Expected Wake-up Time = 8:06 - 0:11*Hours of exercise/week Every additional hour of weekly exercise costs you about 11 minutes of sleep in the morning (p=.0015).

The linear regression model… yi=  + *xi + random errori Follows a normal distribution Fixed – exactly on the line

Assumptions (or the fine print) Linear regression assumes that… 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the same (homogeneity of variances) 4. The observations are independent

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X. Sy/x Sy/x

Recall example: cognitive function and vitamin D Hypothetical data loosely based on [1]; cross-sectional study of 100 middle-aged and older European men. Cognitive function is measured by the Digit Symbol Substitution Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.

Distribution of vitamin D Mean= 63 nmol/L Standard deviation = 33 nmol/L

Distribution of DSST Normally distributed Mean = 28 points Standard deviation = 10 points

Four hypothetical datasets I generated four hypothetical datasets, with increasing TRUE slopes (between vit D and DSST): 0.5 points per 10 nmol/L 1.0 points per 10 nmol/L 1.5 points per 10 nmol/L

Dataset 1: no relationship

Dataset 2: weak relationship

Dataset 3: weak to moderate relationship

Dataset 4: moderate relationship

The “Best fit” line Regression equation: E(Yi) = 28 + 0*vit Di (in 10 nmol/L)

The “Best fit” line Note how the line is a little deceptive; it draws your eye, making the relationship appear stronger than it really is! Regression equation: E(Yi) = 26 + 0.5*vit Di (in 10 nmol/L)

The “Best fit” line Regression equation: E(Yi) = 22 + 1.0*vit Di (in 10 nmol/L)

The “Best fit” line Regression equation: E(Yi) = 20 + 1.5*vit Di (in 10 nmol/L) Note: all the lines go through the point (63, 28)!

Estimating the intercept and slope: least squares estimation A little calculus…. What are we trying to estimate? β, the slope, from What’s the constraint? We are trying to minimize the squared distance (hence the “least squares”) between the observations themselves and the predicted values , or (also called the “residuals”, or left-over unexplained variability) Differencei = yi – (βx + α) Differencei2 = (yi – (βx + α)) 2 Find the β that gives the minimum sum of the squared differences. How do you maximize a function? Take the derivative; set it equal to zero; and solve. Typical max/min problem from calculus…. From here takes a little math trickery to solve for β…

Resulting formulas… Slope (beta coefficient) = Intercept= Regression line always goes through the point:

Relationship with correlation In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

Example: dataset 4 SDx = 33 nmol/L SDy= 10 points Cov(X,Y) = 163 points*nmol/L Beta = 163/332 = 0.15 points per nmol/L = 1.5 points per 10 nmol/L r = 163/(10*33) = 0.49 Or r = 0.15 * (33/10) = 0.49

Significance testing… Slope Distribution of slope ~ Tn-2(β,s.e.( )) H0: β1 = 0 (no linear relationship) H1: β1  0 (linear relationship does exist) Tn-2=

Formula for the standard error of beta (you will not have to calculate by hand!):

Example: dataset 4 Standard error (beta) = 0.03 T98 = 0.15/0.03 = 5, p<.0001 95% Confidence interval = 0.09 to 0.21

Review Problem 4 Researchers fit a regression equation to predict baby weights from weeks of gestation:  Y/X = 100 grams/week*X weeks What is the expected weight of a baby born at 22 weeks? 2000g 2100g 2200g 2300g 2400g

Review Problem 5 The model predicts that: All babies born at 22 weeks will weigh 2200 grams. Babies born at 22 weeks will have a mean weight of 2200 grams with some variation. Both of the above. None of the above.

Residual Analysis: check assumptions The residual for observation i, ei, is the difference between its observed and predicted value Check the assumptions of regression by examining the residuals Examine for linearity assumption Examine for constant variance for all levels of X (homoscedasticity) Evaluate normal distribution assumption Evaluate independence assumption Graphical Analysis of Residuals Can plot residuals vs. X

Predicted values… For Vitamin D = 95 nmol/L (or 9.5 in 10 nmol/L):

Residual = observed - predicted X=95 nmol/L 34

Residual Analysis for Linearity x x x x residuals residuals  Not Linear Linear Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for Homoscedasticity x x x x residuals residuals  Constant variance Non-constant variance Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for Independence Not Independent  Independent X residuals X residuals X residuals Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual plot, dataset 4

Review Problem 6 A medical journal article reported the following linear regression equation: Cholesterol = 150 + 2*(age past 40) Based on this model, what is the expected cholesterol for a 60 year old? 150 370 230 190 200

Review Problem 7 If a particular 60 year old in your study sample had a cholesterol of 250, what is his/her residual? +50 -50 +60 -60

Multiple linear regression… What if age is a confounder here? Older men have lower vitamin D Older men have poorer cognition “Adjust” for age by putting age in the model: DSST score = intercept + slope1xvitamin D + slope2 xage

2 predictors: age and vit D…

Different 3D view…

Fit a plane rather than a line… On the plane, the slope for vitamin D is the same at every age; thus, the slope for vitamin D represents the effect of vitamin D when age is held constant.

Equation of the “Best fit” plane… DSST score = 53 + 0.0039xvitamin D (in 10 nmol/L) - 0.46 xage (in years) P-value for vitamin D >>.05 P-value for age <.0001 Thus, relationship with vitamin D was due to confounding by age!

Multiple Linear Regression More than one predictor… E(y)=  + 1*X + 2 *W + 3 *Z… Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.

Review Problem 8 A medical journal article reported the following linear regression equation: Cholesterol = 150 + 2*(age past 40) + 10*(gender: 1=male, 0=female) Based on this model, what is the expected cholesterol for a 60 year-old man? 150 370 230 190 200

A ttest is linear regression! Divide vitamin D into two groups: Insufficient vitamin D (<50 nmol/L) Sufficient vitamin D (>=50 nmol/L), reference group We can evaluate these data with a ttest or a linear regression…

As a linear regression… Intercept represents the mean value in the sufficient group. Slope represents the difference in means between the groups. Difference is significant. Parameter ````````````````Standard Variable Estimate Error t Value Pr > |t| Intercept 40.07407 1.47511 27.17 <.0001 insuff -7.53060 2.17493 -3.46 0.0008

ANOVA is linear regression! Divide vitamin D into three groups: Deficient (<25 nmol/L) Insufficient (>=25 and <50 nmol/L) Sufficient (>=50 nmol/L), reference group DSST=  (=value for sufficient) + insufficient*(1 if insufficient) + 2 *(1 if deficient) This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable

The picture… Sufficient vs. Insufficient Sufficient vs. Deficient

Results… Interpretation: Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 40.07407 1.47817 27.11 <.0001 deficient 1 -9.87407 3.73950 -2.64 0.0096 insufficient 1 -6.87963 2.33719 -2.94 0.0041 Interpretation: The deficient group has a mean DSST 9.87 points lower than the reference (sufficient) group. The insufficient group has a mean DSST 6.87 points lower than the reference (sufficient) group.

Functions of multivariate analysis: Control for confounders Test for interactions between predictors (effect modification) Improve predictions

Other types of multivariate regression Multiple linear regression is for normally distributed outcomes Logistic regression is for binary outcomes Cox proportional hazards regression is used when time-to-event is the outcome

Common multivariate regression models. Outcome (dependent variable) Example outcome variable Appropriate multivariate regression model Example equation What do the coefficients give you? Continuous Blood pressure Linear regression blood pressure (mmHg) =  + salt*salt consumption (tsp/day) + age*age (years) + smoker*ever smoker (yes=1/no=0) slopes—tells you how much the outcome variable increases for every 1-unit increase in each predictor. Binary High blood pressure (yes/no) Logistic regression ln (odds of high blood pressure) = odds ratios—tells you how much the odds of the outcome increase for every 1-unit increase in each predictor. Time-to-event Time-to- death Cox regression ln (rate of death) = hazard ratios—tells you how much the rate of the outcome increases for every 1-unit increase in each predictor.

Multivariate regression pitfalls Multi-collinearity Residual confounding Overfitting

Multicollinearity Multicollinearity arises when two variables that measure the same thing or similar things (e.g., weight and BMI) are both included in a multiple regression model; they will, in effect, cancel each other out and generally destroy your model. Model building and diagnostics are tricky business!

Residual confounding You cannot completely wipe out confounding simply by adjusting for variables in multiple regression unless variables are measured with zero error (which is usually impossible). Residual confounding can lead to significant effect sizes of moderate size if measurement error is high.

Residual confounding: example Hypothetical Example: In a case-control study of lung cancer, researchers identified a link between alcohol drinking and cancer in smokers only. The OR was 1.3 for 1-2 drinks per day (compared with none) and 1.5 for 3+ drinks per day. Though the authors adjusted for number of cigarettes smoked per day in multivariate (logistic) regression, we cannot rule out residual confounding by level of smoking (which may be tightly linked to alcohol drinking).

Overfitting In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. The model is fit perfectly to the quirks of your particular sample, but has no predictive ability in a new sample.

Overfitting: class data example I asked SAS to automatically find predictors of optimism in our class dataset. Here’s the resulting linear regression model: Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 11.80175 2.98341 11.96067 15.65 0.0019 exercise -0.29106 0.09798 6.74569 8.83 0.0117 sleep -1.91592 0.39494 17.98818 23.53 0.0004 obama 1.73993 0.24352 39.01944 51.05 <.0001 Clinton -0.83128 0.17066 18.13489 23.73 0.0004 mathLove 0.45653 0.10668 13.99925 18.32 0.0011 Exercise, sleep, and high ratings for Clinton are negatively related to optimism (highly significant!) and high ratings for Obama and high love of math are positively related to optimism (highly significant!).

If something seems to good to be true… Clinton, univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 5.43688 2.13476 2.55 0.0188 Clinton Clinton 1 0.24973 0.27111 0.92 0.3675 Sleep, Univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 8.30817 4.36984 1.90 0.0711 sleep sleep 1 -0.14484 0.65451 -0.22 0.8270 Exercise, Univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 6.65189 0.89153 7.46 <.0001 exercise exercise 1 0.19161 0.20709 0.93 0.3658

More univariate models… Obama, Univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 0.82107 2.43137 0.34 0.7389 obama obama 1 0.87276 0.31973 2.73 0.0126 Compare with multivariate result; p<.0001 Love of Math, univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 3.70270 1.25302 2.96 0.0076 mathLove mathLove 1 0.59459 0.19225 3.09 0.0055 Compare with multivariate result; p=.0011

Overfitting Rule of thumb: You need at least 10 subjects for each additional predictor variable in the multivariate regression model. Pure noise variables still produce good R2 values if the model is overfitted. The distribution of R2 values from a series of simulated regression models containing only noise variables. (Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models. Psychosomatic Medicine 66:411-421 (2004).)

Overfitting example, class data… PREDICTORS OF EXERCISE HOURS PER WEEK (multivariate model): Variable Beta p-VALUE Intercept -14.74660 0.0257 Coffee 0.23441 0.0004 wakeup -0.51383 0.0715 engSAT -0.01025 0.0168 mathSAT 0.03064 0.0005 writingLove 0.88753 <.0001 sleep 0.37459 0.0490 R-Square = 0.8192 N=20, 7 parameters in the model!

Univariate models… Variable Beta p-value Coffee 0.05916 0.3990 Wakeup -0.06587 0.8648 MathSAT -0.00021368 0.9731 EngSAT -0.01019 0.1265 Sleep -0.41185 0.4522 WritingLove 0.38961 0.0279