Download presentation
Presentation is loading. Please wait.
Published byGilbert Kelly Modified over 9 years ago
1
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.1 Simple linear regression What regression analysis does The simple regression model Hypothesis testing in regression Residual analysis Inverse prediction, replicated regression and weighted regression Regression caveats Power considerations in simple linear regression What regression analysis does The simple regression model Hypothesis testing in regression Residual analysis Inverse prediction, replicated regression and weighted regression Regression caveats Power considerations in simple linear regression
2
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.2 What regression does Fits a straight line through a cloud of data. Tests and quantifies the effect of an independent variable X on a dependent variable Y. Intensity of the effect is given by the slope (b) of the regression. The importance of the effect is given by the coefficient of determination (r 2 ). Fits a straight line through a cloud of data. Tests and quantifies the effect of an independent variable X on a dependent variable Y. Intensity of the effect is given by the slope (b) of the regression. The importance of the effect is given by the coefficient of determination (r 2 ). X Y XX YY b = Y X
3
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.3 Regression and correlation coefficients The slope b is estimated as: The correlation r is: So, b = r if X and Y have the same variance… and if b = 0, r = 0 and vice versa.
4
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.4 How it does it by the method of least squares, which involves minimizing the sum of squared deviations between the observations and the regression line, i.e. minimizing the residuals Squared deviation of an observation given by: by the method of least squares, which involves minimizing the sum of squared deviations between the observations and the regression line, i.e. minimizing the residuals Squared deviation of an observation given by: X Y ii Residual:
5
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.5 Regression or correlation? Correlation: degree of association between two variables X and Y; no causal relationship assumed! Regression: to predict the value of the dependent variable if the independent variable were changed; causal relationship assumed! Correlation: degree of association between two variables X and Y; no causal relationship assumed! Regression: to predict the value of the dependent variable if the independent variable were changed; causal relationship assumed!
6
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.6 When do we use regression? Don’t use it to determine the strength of association between to variables. Do use it if you want to predict the value of Y given X. Don’t use it to determine the strength of association between to variables. Do use it if you want to predict the value of Y given X. X Y Regression X1X1 X2X2 Correlation
7
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.7 The simple regression model The regression model is: So, all simple regression models are described by 2 parameters, the intercept ( ) and slope (b). The regression model is: So, all simple regression models are described by 2 parameters, the intercept ( ) and slope (b). b = Y X (slope) X XX YY (intercept) ii XiXi YiYi Observed Expected
8
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.8 AssumptionsAssumptions Residuals are independent and normally distributed. The variance of the residuals is equal for all X (homoscedasticity). The relationship between Y and X is linear. There is no measurement error on X (Model I regression). Residuals are independent and normally distributed. The variance of the residuals is equal for all X (homoscedasticity). The relationship between Y and X is linear. There is no measurement error on X (Model I regression).
9
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.9 Measurement error Assumption of no error on X can be examined beforehand, and is almost invariably violated. Only of concern when measurement error is large relative to magnitude of X (say, > 10%). If assumption is invalid, then Model II regression is required. Assumption of no error on X can be examined beforehand, and is almost invariably violated. Only of concern when measurement error is large relative to magnitude of X (say, > 10%). If assumption is invalid, then Model II regression is required.
10
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.10 Residual analysis I: independence Plot residuals against estimates, look for patterns. Do ACF plot. Plot residuals against estimates, look for patterns. Do ACF plot. Estimate Residual
11
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.11 Residual analysis II: Normality Plot residuals against estimates; look for patterns. Do normal probability plot. Check with Lilliefors test. Plot residuals against estimates; look for patterns. Do normal probability plot. Check with Lilliefors test. NEDs Residual Normal Non-normal Residual Estimate
12
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.12 Residual analysis III: Homoscedasticity Plot residuals against estimates; look for patterns. Check with Levene’s test by grouping Y’s into several classes. Plot residuals against estimates; look for patterns. Check with Levene’s test by grouping Y’s into several classes. Estimate Residual Group 1 Group 2 Group 3 Residual Estimate
13
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.13 Residual analysis IV: Linearity Plot residuals against estimates; look for patterns. Residual X Y Estimate
14
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.14 Robustness of regression with respect to violation of assumptions
15
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.15 What to do when assumptions aren’t met Try transforming the data, but remember: (1) for some data, no transformation will work; (2) finding an appropriate transformation may not be easy. Use non-linear regression. Try transforming the data, but remember: (1) for some data, no transformation will work; (2) finding an appropriate transformation may not be easy. Use non-linear regression.
16
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.16 Transformations in regression
17
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.17 oCoC Transformations in regression 10 20 50 100 150 Chirps/min oCoC 10 20 40 80 120 160 Chirps/min (log scale) Chirp rate as a function of temperature in males of the cricket Oecanthus fultoni.
18
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.18 Transformations in regression 70125102050 Relative brightness (times) in log scale 0 1 2 3 4 5 6 7 Millivolts 010203040506070 Relative brightness (times) 0 1 2 3 4 5 6 7 Millivolts Electrical resistance as a function of illumination in cephalopod eyes.
19
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.19 Hypothesis testing I: partitioning the total sums of squares Total SSModel (Explained) SSUnexplained (Error) SS = + Y
20
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.20 Hypothesis testing I: partitioning the total sums of squares So, MS regression = s 2 Y and MS error = 0 if observed = expected. Calculate F = MS R /MS e and compare with F distribution with 1 and N - 2 df. H 0 : F = 0 So, MS regression = s 2 Y and MS error = 0 if observed = expected. Calculate F = MS R /MS e and compare with F distribution with 1 and N - 2 df. H 0 : F = 0
21
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.21 Standard error of the slope The standard error s b and 100(1- ) CIs of the slope are: So, for fixed N, can decrease s b by expanding range of X values sampled. The standard error s b and 100(1- ) CIs of the slope are: So, for fixed N, can decrease s b by expanding range of X values sampled. Y X s b smaller Y s b larger
22
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.22 Standard error of the intercept The standard error s of the intercept is: So, for fixed N, we can decrease s by expanding range of X values sampled. The standard error s of the intercept is: So, for fixed N, we can decrease s by expanding range of X values sampled. X s smaller Y Y s larger
23
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.23 Hypothesis testing II: testing model parameters Test each hypothesis by a t- test: Note: these are 2-tailed hypotheses! Test each hypothesis by a t- test: Note: these are 2-tailed hypotheses! X Y H 02 : b = 0 X Y Y Y H 01 : = 0 Y = 0 Observed Expected
24
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.24 Hypothesis testing III: one-tailed hypotheses Biological theory predicts that Y should increase with X. So, H 0 : b 0 (one-tailed) Calculate: Reject if t b > 0 and p (one- tailed) < Biological theory predicts that Y should increase with X. So, H 0 : b 0 (one-tailed) Calculate: Reject if t b > 0 and p (one- tailed) < YY H 0 accepted H 0 rejected Y X Y
25
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.25 Confidence intervals in regression 100 (1- ) CI for estimated values 100 (1- ) CI for observations
26
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.26 Confidence intervals in regression CI for observations is larger than CI for estimated values. CIs for both estimated values and observations increase with increasing distance between X value and mean of sample. CI for observations is larger than CI for estimated values. CIs for both estimated values and observations increase with increasing distance between X value and mean of sample. X Y Observations Y Estimates
27
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.27 OutliersOutliers points that appear to lie well off the fitted line Issue 1: are “apparent” outliers really outliers? Issue 2: do they significantly affect the statistical conclusions? points that appear to lie well off the fitted line Issue 1: are “apparent” outliers really outliers? Issue 2: do they significantly affect the statistical conclusions? X Y Outlier?
28
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.28 Outlier analysis I: Studentized residuals Plot Studentized residuals against estimated values. “Large” residuals are those with value > 3.0. Such cases make large contributions to residual mean square of the regression. Plot Studentized residuals against estimated values. “Large” residuals are those with value > 3.0. Such cases make large contributions to residual mean square of the regression.
29
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.29 Outlier analysis II: Leverage Leverage measures the potential influence of the case on the regression line. Determined by X value only, so that points far from the mean have higher leverage. “Large” = anything greater than 4/N. Leverage measures the potential influence of the case on the regression line. Determined by X value only, so that points far from the mean have higher leverage. “Large” = anything greater than 4/N. Small leverage Large leverage X Y
30
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.30 Outlier analysis III: Cook’s distance Cook’s distance: measures both leverage and contribution to residual mean square, i.e. actual influence of a point. “Large” = anything greater than 1. Cook’s distance: measures both leverage and contribution to residual mean square, i.e. actual influence of a point. “Large” = anything greater than 1. Smaller Cook’s Larger Cook’s X Y
31
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.31 Resolving outlier problems Do they have a significant effect on regression results? To determine, delete them, rerun analyses and compare results. Are slope and intercept estimates significantly affected, i.e. still lie within 95% CI’s of original estimates? Do they have a significant effect on regression results? To determine, delete them, rerun analyses and compare results. Are slope and intercept estimates significantly affected, i.e. still lie within 95% CI’s of original estimates? Outliers in Outliers out Y No significant effect X Y Significant effect
32
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.32 The effects of outlier deletion Reduces sample size (N), thereby reducing power. Decreases MS e, so s b decreases, and power increases. If N is small, the former effect will probably outweigh the latter unless outliers are very aberrant. Reduces sample size (N), thereby reducing power. Decreases MS e, so s b decreases, and power increases. If N is small, the former effect will probably outweigh the latter unless outliers are very aberrant. Power (1 - ) N smaller N larger s b larger s b smaller s b fixed N fixed 0 0 1
33
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.33 Inverse prediction Regression of Y on X, but want to predict X, given Y. Regression of X on Y not possible due to error in Y. e.g. calibration curves: want to predict concentration from reading, based on regression of reading on known solute concentrations. Regression of Y on X, but want to predict X, given Y. Regression of X on Y not possible due to error in Y. e.g. calibration curves: want to predict concentration from reading, based on regression of reading on known solute concentrations. Reading Concentration Reading Concentration Error in “X”
34
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.34 Inverse prediction Regress Y on X. Generate predicted value of X given Y. Calculate 95% confidence limits for “X” estimate based on 95% confidence limits for “Y” estimate from standard regression. Regress Y on X. Generate predicted value of X given Y. Calculate 95% confidence limits for “X” estimate based on 95% confidence limits for “Y” estimate from standard regression. Y Predicted “X” Lower 95% limit Upper 95% limit
35
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.35 Regression with replication When several Y’s are measured for each X. In this case, we can test the linearity assumption directly by testing the MS due to deviations from linearity over MS within groups. When several Y’s are measured for each X. In this case, we can test the linearity assumption directly by testing the MS due to deviations from linearity over MS within groups. Regression SS Within-group SS SS due to nonlinearity Group SS Error SS
36
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.36 Weighted regression Used when our confidence in the values of individual observations varies, e.g. different measurement error, precision. In replicated designs, variance of Y for given X may vary among X’s, as may sample size (N). So, weight by N or inverse of sample variance. Used when our confidence in the values of individual observations varies, e.g. different measurement error, precision. In replicated designs, variance of Y for given X may vary among X’s, as may sample size (N). So, weight by N or inverse of sample variance. X Y
37
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.37 Regression caveats I: causation A statistically significant regression of Y on X need not imply a causal relationship between the two. A non-significant linear regression need not imply the lack of a causal relationship if the causal relationship is non-linear. A statistically significant regression of Y on X need not imply a causal relationship between the two. A non-significant linear regression need not imply the lack of a causal relationship if the causal relationship is non-linear. Z X Y X Y X Y Accept linear H 0
38
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.38 Regression caveats II: small samples Significant regressions can be obtained by chance, i.e. even when no (linear) causal relationship exists. This is especially true if sample sizes are small. So when doing multiple simple regressions, control e. Significant regressions can be obtained by chance, i.e. even when no (linear) causal relationship exists. This is especially true if sample sizes are small. So when doing multiple simple regressions, control e. X Y True regression (H 0 accepted) Sample regression (H 0 rejected)
39
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.39 Regression caveats III: large samples When N is large, only very small regression coefficients are required to reject H 0 (power is large). So, be careful of “overinterpreting” the observed relationship if R 2 is small. When N is large, only very small regression coefficients are required to reject H 0 (power is large). So, be careful of “overinterpreting” the observed relationship if R 2 is small. True regression (H 0 rejected but R 2 small) X Y
40
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.40 Regression caveats IV: extrapolation and interpolation Be careful when (1) predictions lie outside range of sample; (2) when predictions are for values where data are sparse. X Y Estimated relation True relation X Y Predicted value True value Observations
41
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.41 The final word on extrapolation In the space of one hundred and seventy-six years the Lower Mississippi has shortened itself two hundred and forty-six miles. That is an average of a trifle over one mile and a third per year. Therefore, any calm person, who is not blind or idiotic, can see that in the Old Oölitic Silurian period, just a million years ago next November, the Lower Mississippi River was upwards of one million three hundred thousand miles long, and stuck over the Gulf of Mexico like a fishing rod. And by the same token, any person can see that seven hundred and forty-two years from now, the lower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen. Mark Twain, Life on the Mississippi
42
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.42 Power and sample size in simple linear regression Because the correlation coefficient r and the regression coefficient b are closely related, i.e. … we can transform b to r and evaluate power using r. Because the correlation coefficient r and the regression coefficient b are closely related, i.e. … we can transform b to r and evaluate power using r. X Y
43
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.43 Power and sample size regression If we test H 0 : b = 0 with sample size n, we can determine 1 - by calculating the z-transformed values for the critical value of the corresponding r (at specified ) (z ) and the sample regression coefficient b (z r ), and the one- tailed probability of the normal deviate: X Y
44
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.44 Power and sample size in regression Once Z (1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. . Power is then 1- . Once Z (1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. . Power is then 1- . X Y Z (1) p
45
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.45 Power and sample size in regression: an example Changes in wing length with age in a sample of 13 birds So 1 - = 1.00. Changes in wing length with age in a sample of 13 birds So 1 - = 1.00.
46
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.46 Minimal sample size in regression Given desired power 1 - , how large a sample is required to reject H 0 : b = 0 if it is false and the true regression coefficient is at least b To do so, first calculate regression coefficient 0 corresponding to b . Given desired power 1 - , how large a sample is required to reject H 0 : b = 0 if it is false and the true regression coefficient is at least b To do so, first calculate regression coefficient 0 corresponding to b . X1X1 Y Reject H 0 ? Observed Expected under H 0 : b = 0 True regression (b 0 ) Y Reject H 0 ?
47
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.47 Minimal sample size in regression (cont’d) …then calculate: X1X1 Y Reject H 0 ? Observed Expected under H 0 : b = 0 True regression (b 0 ) Y Reject H 0 ?
48
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.48 Minimal sample size: an example We want to reject H 0 : b = 0 99% of the time when b 0 > 0.2 and (2) =.05 So (1) =.01 and For b =.20, we have... We want to reject H 0 : b = 0 99% of the time when b 0 > 0.2 and (2) =.05 So (1) =.01 and For b =.20, we have...
49
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.49 Minimal sample size (cont’d) So… …and So… …and So, a sample size of at least 8 should be used.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.