Scatter Plots of Data with Various Correlation Coefficients

Slides:



Advertisements
Similar presentations
Kin 304 Regression Linear Regression Least Sum of Squares
Advertisements

Logistic Regression Psy 524 Ainsworth.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Objectives (BPS chapter 24)
Review of ANOVA and linear regression. Review of simple ANOVA.
From last time….. Basic Biostats Topics Summary Statistics –mean, median, mode –standard deviation, standard error Confidence Intervals Hypothesis Tests.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Chapter 12 Multiple Regression
Business Statistics - QBM117 Statistical inference for regression.
Simple Linear Regression Analysis
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Chapter 11 Simple Regression
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Linear correlation and linear regression + summary of tests
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Lecture 10: Correlation and Regression Model.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Remember the equation of a line: Basic Linear Regression As scientists, we find it an irresistible temptation to put a straight line though something that.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
Lecture Slides Elementary Statistics Twelfth Edition
The simple linear regression model and parameter estimation
Correlation Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger.
Statistical Modelling
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Regression Analysis AGEC 784.
Logistic Regression APKC – STATS AFAC (2016).
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Statistical Data Analysis - Lecture /04/03
CHAPTER 7 Linear Correlation & Regression Methods
Statistics for Managers using Microsoft Excel 3rd Edition
Virtual COMSATS Inferential Statistics Lecture-26
Kin 304 Regression Linear Regression Least Sum of Squares
Tests for Continuous Outcomes II
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
John Loucks St. Edward’s University . SLIDES . BY.
Chapter 12: Regression Diagnostics
Chapter 11 Simple Regression
BPK 304W Regression Linear Regression Least Sum of Squares
Generalized Linear Models
Chapter 13 Simple Linear Regression
Correlation & Linear Regression
Lecture Slides Elementary Statistics Thirteenth Edition
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Regression Analysis Week 4.
Statistical Methods For Engineers
CHAPTER 29: Multiple Regression*
Multiple Regression A curvilinear relationship between one variable and the values of two or more other independent variables. Y = intercept + (slope1.
6-1 Introduction To Empirical Models
BASIC REGRESSION CONCEPTS
STA 291 Summer 2008 Lecture 23 Dustin Lueker.
Simple Linear Regression
Basic Practice of Statistics - 3rd Edition Inference for Regression
Chapter 13 Additional Topics in Regression Analysis
CS639: Data Management for Data Science
3.2. SIMPLE LINEAR REGRESSION
STA 291 Spring 2008 Lecture 23 Dustin Lueker.
Multiple Regression Berlin Chen
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Linear Regression and Correlation
Chapter 13 Simple Linear Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Chapter 15 Regression Modeling (Modeling interesting relationships between variables)

Scatter Plots of Data with Various Correlation Coefficients Y Y Y X X X r = -1 r = -.6 r = 0 Y Y Y X X X r = +1 r = +.3 r = 0 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear and Non-Linear Relationships Curvilinear relationships Y Y X X Y Y X X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Prediction If you know something about X, this knowledge helps you predict something about Y. (Sound familiar?…sound like conditional probabilities?)

Regression equation… Expected value of y at a given level of x=

Predicted value for an individual… Yi=  + *xi + random errori Follows a normal distribution Fixed – exactly on the line

Assumptions (or the fine print) Linear regression assumes that… 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the same (homogeneity of variances) 4. The random errors are independent

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X.

Population Linear Regression Model Observedvalue i = Random error Observed value EPI 809/Spring 2008 35

Sample Linear Regression Model i = Random error ^ Unsampled observation Observed value EPI 809/Spring 2008 36

Example: Cognitive function and vitamin D Hypothetical data loosely based on [1]; cross-sectional study of 100 middle-aged and older European men. Cognitive function is measured by the Digit Symbol Substitution Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.

Distribution of vitamin D Mean= 63 nmol/L Standard deviation = 33 nmol/L

Distribution of DSST Normally distributed Mean = 28 points Standard deviation = 10 points

Four hypothetical datasets I generated four hypothetical datasets, with increasing TRUE slopes (between vit D and DSST): 0.5 points per 10 nmol/L 1.0 points per 10 nmol/L 1.5 points per 10 nmol/L

The “Best fit” line Regression equation: E(Yi) = 28 + 0*vit Di (in 10 nmol/L)

The “Best fit” line Note how the line is a little deceptive; it draws your eye, making the relationship appear stronger than it really is! Regression equation: E(Yi) = 26 + 0.5*vit Di (in 10 nmol/L)

The “Best fit” line Regression equation: E(Yi) = 22 + 1.0*vit Di (in 10 nmol/L)

The “Best fit” line Regression equation: E(Yi) = 20 + 1.5*vit Di (in 10 nmol/L) Note: all the lines go through the point (63, 28)!

How to estimate the linear relationship? That is, we want to determine the slope and intercept of the line. Y=mX+B m B

Hypothesis testing H0: slope is zero

Confidence intervals 95% CI for the slope

Qualitative checks on plausibility of model assumptions Examining residuals we can assess If the relationship between variables is linear If the variance of the error term is the same for all values of X (homoscedasticity) If the error term is normally distributed If error terms are independent

Residual = observed - predicted X=95 nmol/L 34

Residual Analysis for Linearity x x x x residuals residuals  Not Linear Linear Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for Homoscedasticity x x x x residuals residuals  Constant variance Non-constant variance Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for Independence Not Independent  Independent residuals time time residuals residuals time Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual plot, dataset 4

Multiple linear regression i.e. Multivariable linear regression What if age is a confounder here? Older men have lower vitamin D Older men have poorer cognition “Adjust or control” for age by putting age in the model: DSST score = intercept + slope1 * vitamin D + slope2 * age

2 predictors: age and vitamin D…

Fit a plane rather than a line… On the plane, the slope for vitamin D is the same at every age; thus, the slope for vitamin D represents the effect of vitamin D when age is held constant.

Equation of the “Best fit” plane… DSST score = 53 + 0.0039*vitamin D (in 10 nmol/L) - 0.46 * age (in years) P-value for vitamin D = 0.75 P-value for age < 0.0001 Thus, relationship with vitamin D was due to confounding by age!

Controlling or adjusting for more variables E(y)=  + 1*X + 2 *W + 3 *Z… Language Outcome Predictor(s) Primary predictor Covariates  

Multicollinearity E.g. Age and weight (in a study of children)

How would you model a non-linear relationship? Linear relationships Curvilinear relationships Y Y X X Y Y X X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Modeling non-linear relationships Change the form of the model from a straight line to whatever you want. Quadratic: Yi = β0 + β2xi2 + ϵi Quadratic: Yi = β0 + β1xi + β2xi2 + ϵi Piecewise linear: Yi = β0 + β1xi + β21{xi > 1}(xi-1) + β31{xi > 2}(xi-2) + ϵi

What if I have a predictor that is… Binary? Pick a level to be zero, and the other to be one. Model as usual. Yi = β0 + β1xi + ϵi Categorical with 3 levels?

Categorical predictors Suppose I have a variable such as race (black, white, other). Create three binary variables Black (yes = 1, no = 0) White (yes = 1, no = 0) Other (yes = 1, no = 0) This is called over parametrized because if black and white are ‘no’ then ‘other’ must be ‘yes.’ It is impossible to separately estimate the coefficients (slopes) for each of these. Intercept = 0, Slopeblack = 1, Slopewhite = 3, Slopeother = 0 Intercept = -10, Slopeblack = 11, Slopewhite = 13, Slopeother = 10 Solution: Exclude ‘other.’ Model: Yi = intercept + slopeblack * blacki + slopewhite * whitei + ϵi

What if my outcome is binary? !@#$ (actually, it’s not too bad) Previous model: Yi is normally distributed with: E(Yi) = intercept + slope * xi New model: Y is Bernoulli distributed with f(E(Yi)) = intercept + slope * xi f(.) is a function whose domain includes (0, 1) and whose codomain is all real numbers. For ‘logistic regression’ f(.) is the logit function, log(p/(1-p)

How to interpret results from logistic regression The estimated ‘slopes’ don’t mean a whole lot. You can interpret them as the rate of increase in the log odds of [outcome] for each unit increase of the predictor. Not that helpful. Recall that log(p/(1-p)) = intercept + slope * xi where p = E(Yi) If you exponentiate both sides you get p/(1-p) = exp(intercept + slope * xi) Note that the left hand side is the ‘odds’ that Yi = 0. The odds ratio (for xi = x+1 vs. xi = x) is exp(intercept + slope * (xi +1)) / exp(intercept + slope * xi) = exp(slope) Thus exponentiating the coefficient gives the odds rato. If the outcome is mortality, predictor is age in years, and exp(slope) = 1.2, then the odds of mortality are 20% greater (on average) for each extra year of age.

Odds and odds ratios are not intuitive People almost always interpret them as if they were relative risks. Odds ratio: [p1/(1-p1)] / [p2/(1-p2)] Relative risk: p1 / p2 Wouldn’t it be nice if we could model the relationship in such a way that would provide relative risks instead… Use a different ‘link’ function. Log(E(Yi)) = intercept + slope * xi Then, the relative risk (for xi = x+1 vs. xi = x) is exp(intercept + slope * (xi +1)) / exp(intercept + slope * xi) = exp(slope)

What distribution? When using the logit link function, the Bernoulli distribution is typically used (logistic regression). When using the log link function, the Poisson distribution is typically used. But… the outcome variable is binary. It certainly isn’t Poisson. If I use a Poisson distribution (mean = variance), I will overestimate my variance… Unless… It has been shown that a Poisson model is very useful for a binary outcome as long as you separately estimate mean and variance for using a ‘robust estimator.’ The model is referred to as ‘modified Poisson regression’ or ‘Poisson regression with robust error estimates.’

Another look Simple linear regression Logistic regression Yi ~ N(β0 + β1xi , σ2) Logistic regression Yi ~ BER( exp(β0 + β1xi) / [1+exp(β0 + β1xi)] ) Poisson regression Yi ~ POI( exp(β0 + β1xi) ) If outcomes are actually binary, be sure to estimate mean and variance separately.