Download presentation
Presentation is loading. Please wait.
1
Lecture 6: Linear Regression and Correlation
Nigel Rozario, MS Jie Zhou, MS H. James Norton, PhD 10/17/2013
2
Introduction
3
Example Points on this line; (0,2) (1,7) (2, 12)
4
Linear Regression Linear Regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. Linear regression has many practical uses. - Prediction, forecasting - quantify the strength of the relationship between y and the Xj
5
Assumptions L: Linear (in parameters) I: Independent N: Normality
E: The variances of are equal X: - The regressors xi are assumed to be error-free, that is they are not contaminated with measurement errors.
6
Correlation Pearson’s product moment correlation coefficient
Assumptions: x and y values follows bivariate normal For ordinal data not normally distributed, use Spearman’s Correlation
10
Caution!
11
Method of Least Squares
Let’s use an example.. Revised SAT ranking UNC-Chapel Hill Journalism Professor Phil Meyer used statistical techniques (least-square regression) to adjust for different SAT participation rates for the 50 states and the District of Columbia. In essence, the technique adjusts the data to reflect what the SAT scores would likely be if the same percentage of students in all states took the tests.
12
Data spreadsheet State Raw_Score Taking_Test Orig_Rank Adjusted_Rank
Adjusted Score New Hampshire 921 0.75 28 1 993 Iowa 1093 0.05 2 990 North Dakota 1073 0.06 3 981 Kansas 1039 0.1 4 Illinois 1006 0.16 10 5 978 Minnesota 1023 0.12 7 6 976 Montana 982 0.22 19 974 Connecticut 897 0.81 33 8 North Carolina 844 0.57 48 49 898 South Carolina 832 0.58 51 50 887 Oregon 922 0.54 27 9 972 Massachusetts 896 0.79 35 971 Wisconsin 0.11 11 Colorado 859 0.29 23 12 969 Tennessee 1015 13 968 Nebraska 1024 14 966 Maryland 904 0.64 32 15 965 Washington 913 0.49 31 16 957 New Jersey 886 0.74 39 17 Vermont 890 0.68 37 18 955
13
R2 shows the amount of variance of Y explained by X.
Outcome variable (Y) This is the p-value of the model. It tests whether R2 is different from 0. Root mean squared error, is the SD of the regression. The closer to zero better the fit. R2 shows the amount of variance of Y explained by X. Two-tail P-value test the hypothesis that each coefficient is different from 0 Predictor variable (X) Expected Score = *Taking_Test
14
Expected Score (North Carolina) = 1020.61-220.51x(Taking_Test)
= Residual (or error) = Raw Score – Expected Score = ≈ Percentage of students who took the test only partly explains what’s the SAT score for each state
15
Another Example Sbp Age 120 18 130 33 134 27 148 58 110 20 137 30
16
Expected sbp = x (age) when age = 30, expected sbp = x (30) = 129 Residual = observed sbp – expected sbp = = 8
17
Multiple Linear Regression
Data (First 10 observations) age bmi sbp 28 24.33 111 26 25.09 101 31 26.61 120 18 32.26 158 50 22.71 125 42 36.48 166 20 25.18 114 29 21.91 143 35 29.41 47 27.28 133 R2 shows the amount of variance of SBP explained by Age & BMI Two-tail P-value test the hypothesis that each coefficient is different from 0 Reference: Biostatistics: A guide to design, analysis and discovery, 2nd ed [Forthofer, Lee, Hernandez]
18
Pearson Correlation Coefficients, N = 50 Prob > |r| under H0: Rho=0
bmi age sbp Predicted SBP= xAge + 1.3xBMI When Age=28 and BMI=24.33 Predicted SBP= x(28) + 1.3x(24.33) = = Residual = Predicted SBP - Observed SBP = – 111 = 5.95
19
Conclusion Simple Linear Regression: one covariate x
Multivariate Linear Regression : multiple covariates X For the previous first example, other factors might influence the SAT scores : - Percentage of parents have college education - The cost on education per student for each state Adding more covariates, R2 always goes up. This brings up another statistics topic - Goodness of Fit test (GOF)
20
Questions or Comments? Questions or Comments?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.