Regression analysis: linear and logistic

Slides:



Advertisements
Similar presentations
Qualitative predictor variables
Advertisements

Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Forecasting Using the Simple Linear Regression Model and Correlation
13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 13 Multiple Regression
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
Chapter 12 Multiple Regression
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Ch. 14: The Multiple Regression Model building
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Correlation and Linear Regression
Tests for Continuous Outcomes II. Overview of common statistical tests Outcome Variable Are the observations correlated? Assumptions independentcorrelated.
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Simple Linear Regression
Understanding Multivariate Research Berry & Sanders.
Measures of relationship Dr. Omar Al Jadaan. Agenda Correlation – Need – meaning, simple linear regression – analysis – prediction.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Examining Relationships in Quantitative Research
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Linear correlation and linear regression + summary of tests
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
Lecture 10: Correlation and Regression Model.
Linear correlation and linear regression + summary of tests Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
A first order model with one binary and one quantitative predictor variable.
1 Introduction to Modeling Beyond the Basics (Chapter 7)
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1.
Stats Methods at IC Lecture 3: Regression.
Chapter 13 Simple Linear Regression
Correlation Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger.
Chapter 14 Introduction to Multiple Regression
Regression Analysis.
Regression Analysis AGEC 784.
Inference for Least Squares Lines
Correlation & Regression
Logistic Regression APKC – STATS AFAC (2016).
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
Statistics for Managers using Microsoft Excel 3rd Edition
Tests for Continuous Outcomes II
Multiple Regression.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Chapter 13 Simple Linear Regression
Simple Linear Regression
Correlation & Linear Regression
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Correlation and Regression
Regression Analysis Week 4.
Prepared by Lee Revere and John Large
Scatter Plots of Data with Various Correlation Coefficients
Chapter 12 Simple Linear Regression and Correlation
1/18/2019 ST3131, Lecture 1.
An Introduction to Correlational Research
CORRELATION AND MULTIPLE REGRESSION ANALYSIS
Chapter Thirteen McGraw-Hill/Irwin
Introduction to Regression
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
MGS 3100 Business Analysis Regression Feb 18, 2016
Chapter 13 Simple Linear Regression
Presentation transcript:

Regression analysis: linear and logistic

Linear correlation and linear regression

Example: class data

New concept: Covariance

Interpreting Covariance Covariance between two random variables: cov(X,Y) > 0 X and Y tend to move in the same direction cov(X,Y) < 0 X and Y tend to move in opposite directions cov(X,Y) = 0 X and Y are independent

Correlation coefficient Pearson’s Correlation Coefficient is standardized covariance (unitless):

Corrrelation Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship

Scatter Plots of Data with Various Correlation Coefficients Y Y Y X X X r = -1 r = -.6 r = 0 Y Y Y X X X r = +1 r = +.3 r = 0 ** Next 4 slides from “Statistics for Managers”4th Edition, Prentice-Hall 2004

Linear Correlation Linear relationships Curvilinear relationships Y Y X X Y Y X X

Linear Correlation Strong relationships Weak relationships Y Y X X Y Y

Linear Correlation No relationship Y X Y X

Review Problem 1 What’s a good guess for the Pearson’s correlation coefficient (r) for this scatter plot? –1.0 +1.0 -.5 -.1

Review Problem 1 What’s a good guess for the Pearson’s correlation coefficient (r) for this scatter plot? –1.0 +1.0 -.5 -.1

Linear regression In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

What is “Linear”? Remember this: Y=mX+B? m B

What’s Slope? A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

Simple linear regression intercept The linear regression model: Hours of exercise/week = 2.0 + 0.39*(MS writing enjoyment score) slope

Simple linear regression Wake Time = 8.7 - 0.2*Hours of exercise/week Every additional hour of weekly exercise costs you about 12 minutes of sleep in the morning.

EXAMPLE The distribution of baby weights at Stanford ~ N(3400, 360000) Your “Best guess” at a random baby’s weight, given no information about the baby, is what? 3400 grams But, what if you have relevant information? Can you make a better guess?

Predictor variable X=gestation time Assume that babies that gestate for longer are born heavier, all other things being equal. Pretend (at least for the purposes of this example) that this relationship is linear. Example: suppose a one-week increase in gestation, on average, leads to a 100-gram increase in birth-weight

Y depends on X Y=birth- weight (g) X=gestation time (weeks) Best fit line is chosen such that the sum of the squared (why squared?) distances of the points (Yi’s) from the line is minimized:

Prediction A new baby is born that had gestated for just 30 weeks. What’s your best guess at the birth-weight? Are you still best off guessing 3400? NO!

At 30 weeks… Y=birth- weight (g) 3000 X=gestation time (weeks) 30

At 30 weeks… Y=birth weight (g) 3000 (x,y)= (30,3000) X=gestation time (weeks) 30

At 30 weeks… The babies that gestate for 30 weeks appear to center around a weight of 3000 grams. In Math-Speak… E(Y/X=30 weeks)=3000 grams Note the conditional expectation

But… Yi=3000 + random errori Note that not every Y-value (Yi) sits on the line. There’s variability. Yi=3000 + random errori In fact, babies that gestate for 30 weeks have birth-weights that center at 3000 grams, but vary around 3000 with some variance 2 Approximately what distribution do birth-weights follow? Normal. Y/X=30 weeks ~ N(3000, 2)

And, if X=20, 30, or 40… Y=birth- weight (g) X=gestation time (weeks)

If X=20, 30, or 40… Y=baby weights (g) X=gestation times (weeks) 20 30 Y/X=40 weeks ~ N(4000, 2) Y/X=30 weeks ~ N(3000, 2) Y/X=20 weeks ~ N(2000, 2) Y=baby weights (g) X=gestation times (weeks) 20 30 40

X=gestation times (weeks) The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X. Sy/x Y=baby weights (g) Sy/x X=gestation times (weeks) 20 30 40

Linear Regression Model Y’s are modeled… Yi= 100*X + random errori Follows a normal distribution Fixed – exactly on the line

Review Problem 2 Using the regression equation:  Y/X = 100 grams/week*X weeks What is the expected weight of a baby born at 22 weeks? 2000g 2100g 2200g 2300g 2400g

Review Problem 2 Using the regression equation:  Y/X = 100 grams/week*X weeks What is the expected weight of a baby born at 22 weeks? 2000g 2100g 2200g 2300g 2400g

Review Problem 3 Our model predicts that: All babies born at 22 weeks will weigh 2200 grams. Babies born at 22 weeks will have a mean weight of 2200 grams with some variation. Both of the above. None of the above.

Review Problem 3 Our model predicts that: All babies born at 22 weeks will weigh 2200 grams. Babies born at 22 weeks will have a mean weight of 2200 grams with some variation. Both of the above. None of the above.

Assumptions (or the fine print) Linear regression assumes that… 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the same (homogeneity of variances)

Non-homogenous variance Y=birth-weight (100g) X=gestation time (weeks)

Residual Residual = observed value – predicted value 3350 grams 33.5 weeks This baby was actually 3380 grams. His residual is +30 grams: 3350 grams At 33.5 weeks gestation, predicted baby weight is 3350 grams

Review Problem 4 A medical journal article reported the following linear regression equation: Cholesterol = 150 + 2*(age past 40) Based on this model, what is the expected cholesterol for a 60 year old? 150 370 230 190 200

Review Problem 4 A medical journal article reported the following linear regression equation: Cholesterol = 150 + 2*(age past 40) Based on this model, what is the expected cholesterol for a 60 year old? 150 370 230 190 200

Review Problem 5 If a particular 60 year old in your study sample had a cholesterol of 250, what is his/her residual? +50 -50 +60 -60

Review Problem 5 If a particular 60 year old in your study sample had a cholesterol of 250, what is his/her residual? +50 -50 +60 -60

A ttest is linear regression! In our class the average drinking in the Democrats (politics 6-10, n=15) was 3.2 drinks/week; in Republicans (n=5), this value was 0.4 drinks/week. We can evaluate these data with a ttest (assuming alcohol consumption is normally distributed):

As a linear regression… alcohol = 3.2 - 2.8*(1=Republican; 0=not)

ANOVA is linear regression! A categorical variable with more than two groups: E.g.: groups 1, 2, and 3 (mutually exclusive) =  (=value for group 1) + 1*(1 if in group 2) + 2 *(1 if in group 3) This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable

Multiple Linear Regression More than one predictor… =  + 1*X + 2 *W + 3 *Z Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.  

Functions of multivariate analysis: Control for confounders Test for interactions between predictors (effect modification) Improve predictions

Review Problem 6 A medical journal article reported the following linear regression equation: Cholesterol = 150 + 2*(age past 40) + 10*(gender: 1=male, 0=female) Based on this model, what is the expected cholesterol for a 60 year-old man? 150 370 230 190 200

Review Problem 6 A medical journal article reported the following linear regression equation: Cholesterol = 150 + 2*(age past 40) + 10*(gender: 1=male, 0=female) Based on this model, what is the expected cholesterol for a 60 year-old man? 150 370 230 190 200

Table 3. Relationship of Combinations of Macronutrients to BP (SBP and DBP) for 11 342 Men, Years 1 Through 6 of MRFIT: Multiple Linear Regression Analyses Linear Regression Coefficient (z Score) Variable SBP DBP Model 1  Total protein, % kcal -0.0346 (-1.10) -0.0568 (-3.17)  Cholesterol, mg/1000 kcal 0.0039 (2.46) 0.0032 (3.51)  Saturated fatty acids, % kcal 0.0755 (1.45) 0.0848 (2.86)  Polyunsaturated fatty acids, % kcal 0.0100 (0.24) -0.0284 (-1.22)  Starch, % kcal 0.1366 (4.98) 0.0675 (4.34)  Other simple carbohydrates, % kcal 0.0327 (1.35) 0.0006 (0.04) Model 2 -0.0344 (-1.10) -0.0489 (-2.77) 0.0034 (2.14) 0.0029 (3.19) 0.0786 (1.73) 0.1051 (4.08) 0.0029 (0.08) -0.0230 (-1.07) 0.1149 (4.65) 0.0608 (4.35) Models controlled for baseline age, race (black, nonblack), education, smoking, serum cholesterol. Circulation. 1996 Nov 15;94(10):2417-23.

In math terms: SBP=  -.0346*(% protein) + age *(Age) …+…. Linear Regression Coefficient (z Score) Variable SBP DBP  Total protein, % kcal -0.0346 (-1.10) -0.0568 (-3.17) Translation: controlled for other variables in the model (as well as baseline age, race, etc.), every 1 % increase in the percent of calories coming from protein correlates with .0346 mmHg decrease in systolic BP. (NS) In math terms: SBP=  -.0346*(% protein) + age *(Age) …+…. Also (from a separate model), every 1 % increase in the percent of calories coming from protein correlates with a .0568 mmHg decrease in diastolic BP. (significant) DBP=  - 05568*(% protein) + age *(Age) …+….

Other types of multivariate regression Multiple linear regression is for normally distributed outcomes Logistic regression is for binary outcomes Cox proportional hazards regression is used when time-to-event is the outcome

Overfitting In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. The model is fit perfectly to the quirks of your particular sample, but has no predictive ability in a new sample. Example (hypothetical): In a randomized trial of an intervention to speed bone healing after fracture, researchers built a multivariate regression model to predict time to recovery in a subset of women (n=12). An automatic selection procedure came up with a model containing age, weight, use of oral contraceptives, and treatment status; the predictors were all highly significant and the model had a nearly perfect R-square of 99.5%. This is likely an example of overfitting. The researchers have fit a model to exactly their particular sample of data, but it will likely have no predictive ability in a new sample. Rule of thumb: You need at least 10 subjects for each additional predictor variable in the multivariate regression model.

Overfitting Pure noise variables still produce good R2 values if the model is overfitted. The distribution of R2 values from a series of simulated regression models containing only noise variables. (Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models. Psychosomatic Medicine 66:411-421 (2004).)

Overfitting example, class data… PREDICTORS OF EXERCISE HOURS PER WEEK (multivariate model): Variable Beta p-VALUE Intercept -14.74660 0.0257 Coffee 0.23441 0.0004 wakeup -0.51383 0.0715 engSAT -0.01025 0.0168 mathSAT 0.03064 0.0005 writingLove 0.88753 <.0001 sleep 0.37459 0.0490 R-Square = 0.8192 N=20, 7 parameters in the model!

Univariate models… Variable Beta p-value Coffee 0.05916 0.3990 Wakeup -0.06587 0.8648 MathSAT -0.00021368 0.9731 EngSAT -0.01019 0.1265 Sleep -0.41185 0.4522 WritingLove 0.38961 0.0279

Logistic Regression

Example: Political party and alcohol… This association could also be analyzed with logistic regression: Republican (yes/no) becomes the binary outcome. Alcohol (continuous) becomes the predictor.

Example: Political party and drinking…

The logistic model… ln(p/1- p) =  + 1*X Logit function =log odds of the outcome

The Logit Model Linear function of risk factors for individual i: Baseline odds Linear function of risk factors for individual i: 1x1 + 2x2 + 3x3 + 4x4 … Logit function (log odds)

Review question 7 If X=.50, what is the logit (=log odds) of X? .50 1.0 2.0 -.50

Review question 7 If X=.50, what is the logit (=log odds) of X? .50 1.0 2.0 -.50

Example: political party and drinking… Model: Log odds of being a Republican (outcome)= Intercept+ Weekly drinks (predictor) Fit the data in logistic regression using a computer…

Fitted logistic model: “Log Odds” of being a Republican = 1.2 -1.9* (d/wk) Slope for drinking can be directly translated into an odds ratio: Interpretation: every 1 drink more per week decreases your odds of being a Republican by 85% (95% CI is 0.021 to 1.003)

The Logit Model Linear function of risk factors for individual i: Baseline odds Linear function of risk factors for individual i: 1x1 + 2x2 + 3x3 + 4x4 … Logit function (log odds)

To get back to OR’s…

“Adjusted” Odds Ratio Interpretation

Adjusted odds ratio, continuous predictor

Practical Interpretation The odds of disease increase multiplicatively by eß for for every one-unit increase in the exposure, controlling for other variables in the model.

Practice interpreting the table from a case-control study:

Practice interpreting the table from a paper:

Review problem 8 In a cross-sectional study of heart disease in middle-aged men and women, 10% of men in the sample had prevalent heart disease compared with only 5% of women. After adjusting for age in multivariate logistic regression, the odds ratio for heart disease comparing males to females was 1.1 (95% confidence interval: 0.80—1.42). What conclusion can you draw? Being male increases your risk of heart disease. Age is a confounder of the relationship between gender and heart disease. There is a statistically significant association between gender and heart disease. The study had insufficient power to detect an effect.

Review problem 8 In a cross-sectional study of heart disease in middle-aged men and women, 10% of men in the sample had prevalent heart disease compared with only 5% of women. After adjusting for age in multivariate logistic regression, the odds ratio for heart disease comparing males to females was 1.1 (95% confidence interval: 0.80—1.42). What conclusion can you draw? Being male increases your risk of heart disease. Age is a confounder of the relationship between gender and heart disease. There is a statistically significant association between gender and heart disease. The study had insufficient power to detect an effect.

Homework Continue reading textbook Problem Set 8 Journal article