Scatter Plots of Data with Various Correlation Coefficients

Chapter 15 Regression Modeling (Modeling interesting relationships between variables)

Scatter Plots of Data with Various Correlation Coefficients
Y Y Y X X X r = -1 r = -.6 r = 0 Y Y Y X X X r = +1 r = +.3 r = 0 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear and Non-Linear Relationships
Curvilinear relationships Y Y X X Y Y X X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Prediction If you know something about X, this knowledge helps you predict something about Y. (Sound familiar?…sound like conditional probabilities?)

Regression equation… Expected value of y at a given level of x=

Predicted value for an individual…
Yi=  + *xi random errori Follows a normal distribution Fixed – exactly on the line

Assumptions (or the fine print)
Linear regression assumes that… 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the same (homogeneity of variances) 4. The random errors are independent

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X.

Population Linear Regression Model
Observedvalue i = Random error Observed value EPI 809/Spring 2008 35

Sample Linear Regression Model
i = Random error ^ Unsampled observation Observed value EPI 809/Spring 2008 36

Example: Cognitive function and vitamin D
Hypothetical data loosely based on [1]; cross-sectional study of 100 middle-aged and older European men. Cognitive function is measured by the Digit Symbol Substitution Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry Jul;80(7):722-9.

Distribution of vitamin D
Mean= 63 nmol/L Standard deviation = 33 nmol/L

Distribution of DSST Normally distributed Mean = 28 points
Standard deviation = 10 points

Four hypothetical datasets
I generated four hypothetical datasets, with increasing TRUE slopes (between vit D and DSST): 0.5 points per 10 nmol/L 1.0 points per 10 nmol/L 1.5 points per 10 nmol/L

The “Best fit” line Regression equation:
E(Yi) = *vit Di (in 10 nmol/L)

The “Best fit” line Note how the line is a little deceptive; it draws your eye, making the relationship appear stronger than it really is! Regression equation: E(Yi) = *vit Di (in 10 nmol/L)

E(Yi) = *vit Di (in 10 nmol/L)

E(Yi) = *vit Di (in 10 nmol/L) Note: all the lines go through the point (63, 28)!

How to estimate the linear relationship?
That is, we want to determine the slope and intercept of the line. Y=mX+B m B

Hypothesis testing H0: slope is zero

Confidence intervals 95% CI for the slope

Qualitative checks on plausibility of model assumptions
Examining residuals we can assess If the relationship between variables is linear If the variance of the error term is the same for all values of X (homoscedasticity) If the error term is normally distributed If error terms are independent

Residual = observed - predicted
X=95 nmol/L 34

Residual Analysis for Linearity
x x x x residuals residuals  Not Linear Linear Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for Homoscedasticity
x x x x residuals residuals  Constant variance Non-constant variance Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for Independence
Not Independent  Independent residuals time time residuals residuals time Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual plot, dataset 4

Multiple linear regression i.e. Multivariable linear regression
What if age is a confounder here? Older men have lower vitamin D Older men have poorer cognition “Adjust or control” for age by putting age in the model: DSST score = intercept + slope1 * vitamin D + slope2 * age

2 predictors: age and vitamin D…

Fit a plane rather than a line…
On the plane, the slope for vitamin D is the same at every age; thus, the slope for vitamin D represents the effect of vitamin D when age is held constant.

Equation of the “Best fit” plane…
DSST score = *vitamin D (in 10 nmol/L) * age (in years) P-value for vitamin D = 0.75 P-value for age < Thus, relationship with vitamin D was due to confounding by age!

Controlling or adjusting for more variables
E(y)=  + 1*X + 2 *W + 3 *Z… Language Outcome Predictor(s) Primary predictor Covariates

Multicollinearity E.g. Age and weight (in a study of children)

How would you model a non-linear relationship?
Linear relationships Curvilinear relationships Y Y X X Y Y X X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Modeling non-linear relationships
Change the form of the model from a straight line to whatever you want. Quadratic: Yi = β0 + β2xi2 + ϵi Quadratic: Yi = β0 + β1xi + β2xi2 + ϵi Piecewise linear: Yi = β0 + β1xi + β21{xi > 1}(xi-1) + β31{xi > 2}(xi-2) + ϵi

What if I have a predictor that is…
Binary? Pick a level to be zero, and the other to be one. Model as usual. Yi = β0 + β1xi + ϵi Categorical with 3 levels?

Categorical predictors
Suppose I have a variable such as race (black, white, other). Create three binary variables Black (yes = 1, no = 0) White (yes = 1, no = 0) Other (yes = 1, no = 0) This is called over parametrized because if black and white are ‘no’ then ‘other’ must be ‘yes.’ It is impossible to separately estimate the coefficients (slopes) for each of these. Intercept = 0, Slopeblack = 1, Slopewhite = 3, Slopeother = 0 Intercept = -10, Slopeblack = 11, Slopewhite = 13, Slopeother = 10 Solution: Exclude ‘other.’ Model: Yi = intercept + slopeblack * blacki + slopewhite * whitei + ϵi

What if my outcome is binary?
(actually, it’s not too bad) Previous model: Yi is normally distributed with: E(Yi) = intercept + slope * xi New model: Y is Bernoulli distributed with f(E(Yi)) = intercept + slope * xi f(.) is a function whose domain includes (0, 1) and whose codomain is all real numbers. For ‘logistic regression’ f(.) is the logit function, log(p/(1-p)

How to interpret results from logistic regression
The estimated ‘slopes’ don’t mean a whole lot. You can interpret them as the rate of increase in the log odds of [outcome] for each unit increase of the predictor. Not that helpful. Recall that log(p/(1-p)) = intercept + slope * xi where p = E(Yi) If you exponentiate both sides you get p/(1-p) = exp(intercept + slope * xi) Note that the left hand side is the ‘odds’ that Yi = 0. The odds ratio (for xi = x+1 vs. xi = x) is exp(intercept + slope * (xi +1)) / exp(intercept + slope * xi) = exp(slope) Thus exponentiating the coefficient gives the odds rato. If the outcome is mortality, predictor is age in years, and exp(slope) = 1.2, then the odds of mortality are 20% greater (on average) for each extra year of age.

Odds and odds ratios are not intuitive
People almost always interpret them as if they were relative risks. Odds ratio: [p1/(1-p1)] / [p2/(1-p2)] Relative risk: p1 / p2 Wouldn’t it be nice if we could model the relationship in such a way that would provide relative risks instead… Use a different ‘link’ function. Log(E(Yi)) = intercept + slope * xi Then, the relative risk (for xi = x+1 vs. xi = x) is exp(intercept + slope * (xi +1)) / exp(intercept + slope * xi) = exp(slope)

What distribution? When using the logit link function, the Bernoulli distribution is typically used (logistic regression). When using the log link function, the Poisson distribution is typically used. But… the outcome variable is binary. It certainly isn’t Poisson. If I use a Poisson distribution (mean = variance), I will overestimate my variance… Unless… It has been shown that a Poisson model is very useful for a binary outcome as long as you separately estimate mean and variance for using a ‘robust estimator.’ The model is referred to as ‘modified Poisson regression’ or ‘Poisson regression with robust error estimates.’

Another look Simple linear regression Logistic regression
Yi ~ N(β0 + β1xi , σ2) Logistic regression Yi ~ BER( exp(β0 + β1xi) / [1+exp(β0 + β1xi)] ) Poisson regression Yi ~ POI( exp(β0 + β1xi) ) If outcomes are actually binary, be sure to estimate mean and variance separately.

Scatter Plots of Data with Various Correlation Coefficients

Similar presentations

Presentation on theme: "Scatter Plots of Data with Various Correlation Coefficients"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scatter Plots of Data with Various Correlation Coefficients

Similar presentations

Presentation on theme: "Scatter Plots of Data with Various Correlation Coefficients"— Presentation transcript:

Similar presentations

About project

Feedback