Regression and Correlation GTECH 201 Lecture 18
ANOVA Analysis of Variance Continuation from matched-pair difference of means tests; but now for 3+ cases We still check whether samples come from one or more distinct populations Variance is a descriptive parameter ANOVA compares group means and looks whether they differ sufficiently to reject H0
ANOVA H0 and HA
ANOVA Test Statistic MSB = between-group mean squares MSW = within-group mean squares Between-group variability is calculated in three steps: Calculate overall mean as weighted average of sample means Calculate between-group sum of squares Calculate between-group mean squares (MSB)
Between-group Variability Total or overall mean Between-group sum of squares Between-group mean squares
Within-group Variability Within-group sum of squares Within-group mean squares
Kruskal-Wallis Test Nonparametric equivalent of ANOVA Extension of Wilcoxon rank sum W test to 3+ cases Average rank is Ri / ni Then the Kruskal-Wallis H test statistic is With N =n1 + n2 + … +nk = total number of observations, and Ri = sum of ranks in sample i
ANOVA Example House prices by neighborhood in ,000 dollars A B C D 175 151 127 174 147 183 142 182 138 174 124 210 156 181 150 191 184 193 180 148 205 196
ANOVA Example, continued Sample statistics n X s A 6 158.00 17.83 B 7 183.29 17.61 C 5 144.60 22.49 D 4 189.25 15.48 Total 22 168.68 24.85 Now fill in the six steps of the ANOVA calculation
The Six Steps
Correlation Co-relatedness between 2+ variables As the values of one variable go up, those of the other change proportionally Two step approach: Graphically - scatterplot Numerically – correlation coefficients
Is There a Correlation?
Scatterplots Exploratory analysis
Pearson’s Correlation Index Based on concept of covariance = covariation between X and Y = deviation of X from its mean = deviation of Y from its mean Pearson’s correlation coefficient
Sample and Population r is the sample correlation coefficient Applying the t distribution, we can infer the correlation for the whole population Test statistic for Pearson’s r
Correlation Example Lake effect snow
Spearman’s Rank Correlation Non-parametric alternative to Pearson Logic similar to Kruskal and Wilcoxon Spearman’s rank correlation coefficient
Regression In correlation we observe degrees of association but no causal or functional relationship In regression analysis, we distinguish an independent from a dependent variable Many forms of functional relationships bivariate linear multivariate non-linear (curvi-linear)
Graphical Representation In correlation analysis either variable could be depicted on either axis In regression analysis, the independent variable is always on the X axis Bivariate relationship is described by a best-fitting line through the scatterplot
Least-Square Regression Objective: minimize
Regression Equation Y = a + bX
Strength of Relationship How much is explained by the regression equation?
Coefficient of Determination Total variation of Y (all the bucket water) Large ‘Y’ = dependent variable Small ‘y’ = deviation of each value of Y from its mean e = explained; u = unexplained
Explained Variation Ratio of square of covariation between X and Y to the variation in X where Sxy = covariation between X and Y Sx2 = total variation of X Coefficient of determination
Error Analysis r 2 tells us what percentage of the variation is accounted for by the independent variable This then allows us to infer the standard error of our estimate which tells us, on average, how far off our prediction would be in measurement units